My ML Project

Authors
Affiliation

Name I, First Name I

Name of the University

Name II, First Name II

Published

May 13, 2024

Abstract

The following machine learning project focuses on…

1 Introduction

1.1 Overview and Motivation

1.1.1 Context and Background

The Swiss real estate market, characterized by its resilience and complexity, presents a significant opportunity for advanced analytical approaches to understand pricing dynamics. This project, undertaken as part of a Master’s degree in Machine Learning at the University of Lausanne, aims to harness the power of data science to predict real estate market prices in Switzerland. Utilizing contemporary machine learning techniques within this academic framework not only enhances the learning experience but also contributes to a practical understanding of real estate valuation.

As housing prices continue to fluctuate amid economic uncertainties, such as interest rate changes and demographic shifts Credit Suisse, this investigation is not only timely but also of significant importance to potential investors, policymakers, and the academic community.

1.1.2 Aim Of The Investigation

The primary objective of this study is to predict Swiss real estate market prices using real-time data scraped from ImmoScout24, a prominent Swiss real estate website. This study addresses the significant question of

  • How can machine learning models utilize real-time data scraped from online real estate platforms to predict price trends in the Swiss real estate market?
  • How can machine learning models predict the sale prices of real estate properties in Switzerland based on current market data?

The relevance of this investigation is underpinned by the substantial financial implications of real estate investments and the benefit of predictive insights for both buyers and sellers in the market. The relevance of this study is underscored by the potential insights it could offer, where real estate plays a pivotal role in financial stability and growth.

1.1.3 Description of the Data

The data for this study consists of a meticulously compiled dataset from ImmoScout24, featuring a wide array of variables related to property listings across Switzerland. Fields in the dataset include price, number of rooms, square meters, address, canton, property type, floor, and year of construction. These data points have been gathered through a robust scraping algorithm designed to collect a comprehensive snapshot of the current market. This dataset provides a granular view of the market, essential for training robust machine learning models.

1.1.4 Methodology

This project employs model-based machine learning techniques to quantify the impact of various factors on property prices in Switzerland. The methodology involves training predictive models to interpret the complex relationships within the data, providing a statistical basis for price prediction. This approach allows for an examination of both linear dependencies and more intricate interactions, such as how location and property type combine to influence pricing.

1.1.5 Structure of the Report

The report is structured as follows to provide a coherent narrative and logical flow of analysis:

  • Section 1: Introduction - Outlines the research context, objectives, and significance.
  • Section 2: Data - Details the sources, nature, and preprocessing of the data used.
  • Section 3: Exploratory Data Analysis (EDA) - Analyzes the data to uncover patterns and anomalies.
  • Section 4: Unsupervised Learning - Applies clustering techniques to understand market segmentation.
  • Section 5: Supervised Learning - Discusses the development and validation of predictive models.
  • Section 6: Conclusion - Summarizes the findings, discusses the implications, and suggests areas for further research.

2 Data

  • Sources
  • Description
  • Wrangling/cleaning
  • Spotting mistakes and missing data (could be part of EDA too)
  • Listing anomalies and outliers (could be part of EDA too)

2.1 Properties Dataset

The dataset is stored in CSV format, comprising real estate listings scraped from the ImmoScout24 platform as.

The instances in the dataset represent individual property listings. Each row corresponds to a unique listing on ImmoScout24.

They are composed of :

Click to show code
# Create a data frame with the information
property_info <- data.frame(
  Feature = c("Price (CHF)", "Number of Rooms", "Square Meters (m²)", "Address", "Canton", "Property Type", "Floor", "Year of Construction"),
  Description = c("Numerical - Ranges from 25,000 to 26,149,500 CHF.",
                  "Numerical - Values from 1 to 27 rooms, representing the total room count excluding bathrooms and kitchen.",
                  "Numerical - Property sizes range from 1 to 2,700 m².",
                  "Text - Specific address of the property.",
                  "Categorical - Indicates the Swiss canton where the property is located.",
                  "Categorical - Includes types such as Apartment, Single House, Villa, etc.",
                  "Text/Numerical - Specifies if it on the ground floor or not (e.g., ground floor).",
                  "Categorical - Grouped into categories such as 0-1919, 1920-1945, ... until 2024")
)

# Create the kable extra table
property_table <- kable(property_info, format = "html") %>%
  kable_styling(full_width = FALSE, position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))
property_table
Feature Description
Price (CHF) Numerical - Ranges from 25,000 to 26,149,500 CHF.
Number of Rooms Numerical - Values from 1 to 27 rooms, representing the total room count excluding bathrooms and kitchen.
Square Meters (m²) Numerical - Property sizes range from 1 to 2,700 m².
Address Text - Specific address of the property.
Canton Categorical - Indicates the Swiss canton where the property is located.
Property Type Categorical - Includes types such as Apartment, Single House, Villa, etc.
Floor Text/Numerical - Specifies if it on the ground floor or not (e.g., ground floor).
Year of Construction Categorical - Grouped into categories such as 0-1919, 1920-1945, ... until 2024

Source - ImmoScout24

2.1.1 Raw dataset

Here is the raw dataset: ::: {.cell layout-align=“center”}

Click to show code
properties <- read.csv(file.path(here(),"data/properties.csv"))
# show 1000 first rows of properties using reactable
reactable(head(properties, 1000))
Click to show code

# Create a tibble with cantons and observations
observations_table <- tibble(
  Canton = c("Vaud", "Bern", "Lucerne", "Zurich", "Uri", "Schwyz",
             "Obwalden", "Nidwalden", "Glarus", "St. Gallen", "Grisons", 
             "Aargau", "Thurgau", "Ticino", "Valais", "Neuchatel", 
             "Geneva", "Jura", "Zug", "Fribourg", "Solothurn", 
             "Basel-Stadt", "Basel-Landschaft", "Schaffhausen", 
             "Appenzell-Ausser-Rhoden", "Appenzell-Inner-Rhoden", "Total"),
  Observations = c(3232, 1553, 376, 1191, 71, 93, 29, 51, 55, 757, 405,
                   1481, 553, 4230, 3601, 513, 629, 329, 69, 1242, 590, 
                   149, 705, 118, 102, 12, sum(c(3232, 1553, 376, 1191, 71, 93, 29, 51, 55, 757, 405,
                                               1481, 553, 4230, 3601, 513, 629, 329, 69, 1242, 590, 
                                               149, 705, 118, 102, 12)))
)

# Display the table using kable and kableExtra
observations_table %>%
  kbl(caption = "Number of Observations by Canton") %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover")) %>%
  add_header_above(c(" " = 1, "Observations" = 1)) # Adds headers spanning columns
Number of Observations by Canton
Observations
Canton Observations
Vaud 3232
Bern 1553
Lucerne 376
Zurich 1191
Uri 71
Schwyz 93
Obwalden 29
Nidwalden 51
Glarus 55
St. Gallen 757
Grisons 405
Aargau 1481
Thurgau 553
Ticino 4230
Valais 3601
Neuchatel 513
Geneva 629
Jura 329
Zug 69
Fribourg 1242
Solothurn 590
Basel-Stadt 149
Basel-Landschaft 705
Schaffhausen 118
Appenzell-Ausser-Rhoden 102
Appenzell-Inner-Rhoden 12
Total 22136

:::

2.1.2 Wrangling and Cleaning

The data wrangling and cleaning process undertaken for our dataset involves a series of methodical steps aimed at refining the data for accurate analysis and modeling. Initially, the script identifies and addresses problematic values in the number_of_rooms field, where non-numeric entries are detected and marked for further cleaning. The price field is then sanitized to remove any non-numeric characters, ensuring all data in this column represents valid numerical values. A filter is applied to exclude properties priced under 20,000 CHF to avoid anomalies that could skew analysis, likely due to data entry errors or unrealistic listing prices.

  • Clarify any aggregation methods used or reasons behind choosing specific thresholds for filtering data (e.g., why properties priced under 20,000 CHF were excluded).

Further cleaning includes standardizing addresses by stripping unwanted characters at the beginning, enhancing uniformity across this variable. The dataset undergoes a reduction of incomplete data through the removal of rows with any NA values, which helps in maintaining a dataset with complete cases for analysis. Special attention is given to the square_meters field by removing non-digit characters and converting the cleansed string to numeric values, followed by discarding any remaining NA entries that emerge from this transformation.

  • Explain why we remove NA from m2 column. above is it correct ?

Categorical data in year_category is truncated to maintain consistency and converted into a factor for potential use in modeling. In dealing with the number_of_rooms, non-relevant text (like “rooms” or “room”) is removed, and checks are implemented to discard erroneous data such as entries mistakenly including “m²” or where room counts exceed 100 rooms, likely due to data entry oversights. An additional corrective measure divides the number of rooms by ten for entries unusually higher than 27, assuming these were misreported due to decimal placement errors.

Lastly, the script enhances the readability and standardization of the canton field by capitalizing each entry, which is important for categorical consistency across the dataset. These meticulous steps ensure the dataset’s integrity and readiness for analytical processes, focusing on maintaining a robust, clean, and usable data structure. ::: {.cell layout-align=“center”}

Click to show code
# Identify values causing the issue
problematic_values <- properties$number_of_rooms[is.na(as.numeric(properties$number_of_rooms))]
#> Warning: NAs introduced by coercion
# Replace non-numeric values with NA
#properties$number_of_rooms <- as.numeric(gsub("[^0-9.]", "", properties$number_of_rooms))

# Remove non-numeric characters and convert to numeric
properties$price <- as.numeric(gsub("[^0-9]", "", properties$price))

# Subset the dataset to exclude rows with price < 20000
properties <- properties[properties$price >= 20000, ]

# Subset the dataset to exclude rows with numbers of rooms < 25
#properties <- properties[properties$number_of_rooms <25, ]

# Replace incomplete addresses
properties$address <- gsub("^\\W*[.,0-]\\W*", "", properties$address)

properties_filtered <- na.omit(properties)

properties_filtered$year_category <- substr(properties_filtered$year_category, 1, 9)
# Assuming 'year_category' is a column in the 'properties' dataset
properties_filtered$year_category <- as.factor(properties_filtered$year_category)


# remove m^2 from column 'square_meters'
properties_filtered$square_meters <- as.numeric(gsub("\\D", "", properties_filtered$square_meters))
# print how many NA observations left in square_meters
print(sum(is.na(properties_filtered$square_meters)))
#> [1] 1056
# remove NA
properties_filtered <- properties_filtered[!is.na(properties_filtered$square_meters),]
# add majuscule to canton
properties_filtered$canton <- tools::toTitleCase(properties_filtered$canton)

# # Preprocess the number_of_rooms column
properties_filtered$number_of_rooms <- gsub(" rooms", "", properties_filtered$number_of_rooms)
properties_filtered$number_of_rooms <- gsub(" room", "", properties_filtered$number_of_rooms)
# Remove rows with "m²" in the "number_of_rooms" column
properties_filtered <- properties_filtered[!grepl("m²", properties_filtered$number_of_rooms), ]
properties_filtered$number_of_rooms <- as.numeric(properties_filtered$number_of_rooms)
# Remove rows with rooms >= 100
properties_filtered <- properties_filtered[properties_filtered$number_of_rooms <= 100, , drop = FALSE]
# Divide cells by 10 if number of rooms is more than 27
properties_filtered$number_of_rooms <- ifelse(properties_filtered$number_of_rooms > 27,
                                               properties_filtered$number_of_rooms / 10,
                                               properties_filtered$number_of_rooms)





# properties_filtered$number_of_rooms <- as.character(properties_filtered$number_of_rooms)
# properties_filtered$number_of_rooms <- gsub("\\D", properties_filtered$number_of_rooms)  # Remove non-numeric characters
# properties_filtered$number_of_rooms <- as.numeric(properties_filtered$number_of_rooms)       # Convert to numeric
# properties_filtered$number_of_rooms <- trunc(properties_filtered$number_of_rooms)             # Truncate non-integer values

# show 100 first row of cleaned dataset using reactable
reactable(head(properties_filtered, 100))

:::

Here we can have a glance at the Summary Statistics: ::: {.cell layout-align=“center”}

Click to show code
#filter properties_filtered to contains only 'price', 'number_of_rooms', 'square_meters'
properties_summary <- properties_filtered[, c('price', 'number_of_rooms', 'square_meters')]
#summary statistics
summary(properties_summary)
#>      price          number_of_rooms square_meters 
#>  Min.   :   25000   Min.   : 1.00   Min.   :   1  
#>  1st Qu.:  690000   1st Qu.: 3.50   1st Qu.:  99  
#>  Median :  992340   Median : 4.50   Median : 136  
#>  Mean   : 1351000   Mean   : 4.98   Mean   : 160  
#>  3rd Qu.: 1550000   3rd Qu.: 5.50   3rd Qu.: 190  
#>  Max.   :26149500   Max.   :27.00   Max.   :2700
# Data
data <- data.frame(
  price = c(25000, 690000, 992340, 1351000, 1550000, 26149500),
  number_of_rooms = c(1.0, 3.50, 4.50, 4.98, 5.50, 27.0),
  square_meters = c(1, 99, 136, 160, 190, 2700)
)

# Summary statistics
summary_stats <- summary(data)

# Summary statistics
summary_stats <- cbind(
  Min = apply(data, 2, min),
  Q1 = apply(data, 2, quantile, probs = 0.25),
  Median = apply(data, 2, median),
  Mean = apply(data, 2, mean),
  Q3 = apply(data, 2, quantile, probs = 0.75),
  Max = apply(data, 2, max)
)

# Create table
table <- kbl(summary_stats, align = rep('c', 6), caption = "Summary statistics for the dataset") %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "hover", "condensed", "responsive"))
table
Summary statistics for the dataset
Min Q1 Median Mean Q3 Max
price 25000 7.66e+05 1.17e+06 5.13e+06 1.50e+06 26149500
number_of_rooms 1 3.75e+00 4.74e+00 7.75e+00 5.37e+00 27
square_meters 1 1.08e+02 1.48e+02 5.48e+02 1.82e+02 2700

:::

  • ecrire un petit text sur summary statistics ??

2.1.3 AMTOVZ_CSV_LV95 Data Description

The AMTOVZ_CSV_LV95 Data, provided by the Swiss Federal Office of Topography (swisstopo), is an “Official Index of Localities” that encompasses comprehensive geospatial and administrative details of all localities in Switzerland and the Principality of Liechtenstein. It includes critical data such as locality names, postal codes, and geographic perimeters.

This dataset, in CSV format, is regularly updated on a monthly basis, incorporating changes reported by cantonal authorities and Swiss Post. It serves various purposes, including spatial analysis, integration with other geographic datasets, usage as a geolocated background in GIS (Geographic Information Systems) and CAD (Computer-Aided Design) systems, statistical analysis, and as a reference dataset for information systems.

Updates and release notes for the dataset are provided periodically, detailing changes and improvements made over time. The Swiss Federal Office of Topography manages and distributes this dataset as part of its responsibilities in collecting and providing official geospatial data for Switzerland.

Click to show code
location_info <- data.frame(
  Feature = c("Names (Gemeindename)", "Postal Codes (PLZ)", "Canton Codes (Kantonskürzel)", "Longitude (E)", "Latitude (N)"),
  Description = c("Text - Names of the localities.",
                  "Numerical - Postal code of each locality.",
                  "Categorical - Short code representing each canton.",
                  "Numerical - Geographic coordinate specifying the east-west position.",
                  "Numerical - Geographic coordinate specifying the north-south position.")
)

# Create the kable extra table
location_table <- kable(location_info, format = "html") %>%
  kable_styling(full_width = FALSE, position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))

# Print the table
location_table
Feature Description
Names (Gemeindename) Text - Names of the localities.
Postal Codes (PLZ) Numerical - Postal code of each locality.
Canton Codes (Kantonskürzel) Categorical - Short code representing each canton.
Longitude (E) Numerical - Geographic coordinate specifying the east-west position.
Latitude (N) Numerical - Geographic coordinate specifying the north-south position.

The instances in the dataset represent individual localities within Switzerland and the Principality of Liechtenstein.

Source - swisstopo

2.1.3.1 Creating Variable zip_code and merging with AMTOVZ_CSV_LV95

A new zip_code variable is created by extracting numeric values from the address column of the filtered properties dataset (properties_filtered). This is done using a regular expression that removes non-digit characters, assuming zip codes are embedded within the address string. The resulting zip codes are then cleaned to ensure they are four-digit numbers, removing any leading digits if the extracted value exceeds four digits. ::: {.cell layout-align=“center”}

Click to show code
df <- properties_filtered
#the address column is like : '1844 Villeneuve VD' and has zip code number in it
#taking out the zip code number and creating a new column 'zip_code'
#the way to identify the zip code is to identify numbers that are 4 digits long
df$zip_code <- as.numeric(gsub("\\D", "", df$address))
#removing the first two number of zip code has more than 4 number
df$zip_code <- ifelse(df$zip_code > 9999, df$zip_code %% 10000, df$zip_code)

:::

The properties_filtered dataset is merged with the prepared AMTOVZ_CSV_LV95 dataframe based on the zip_code, aiming to enrich property listings with accurate locality names, canton information, and geographic coordinates.

Click to show code
#read .csv AMTOVZ_CSV_LV95
amto <- read.csv(file.path(here(),"data/AMTOVZ_CSV_WGS84.csv"), sep = ";")
#creating a new dataframe with 'Ortschaftsname' as 'City'Place_name', 'PLZ' as 'zip_code', 'KantonskÃ.rzel' as 'Canton_code', 'E' as 'lon' and 'N' as 'lat'
amto_df <- amto[, c('Gemeindename', 'PLZ', 'Kantonskürzel', 'E', 'N')]
#renaming the columns
colnames(amto_df) <- c('Community', 'zip_code', 'Canton_code', 'lon', 'lat')

#remove duplicates of zip code
amto_df <- amto_df[!duplicated(amto_df$zip_code),]

#add the variable of amto_df to the df if the zip code matches
df <- merge(df, amto_df, by = "zip_code", all.x = TRUE)

#check if there are nan in city
df[is.na(df$Community),]
#>       zip_code    price number_of_rooms square_meters
#> 1           25  2200000            10.0           263
#> 2           25  2200000             6.5           165
#> 3           26   655000             3.5            66
#> 4           26  1995000             7.5           180
#> 5          322   870000             2.5            59
#> 6          322   880000             2.5            55
#> 7          322   975000             3.5            56
#> 230       1014  1510000             5.5           146
#> 1137      1200 16092000             7.0           400
#> 1138      1200   679000             5.5           142
#> 1139      1200  3285450             5.0           230
#> 5481      1919  2558620             5.5           270
#> 5482      1919  1908000             6.5           210
#> 5483      1919  1065000             4.5           130
#> 5484      1919   785000             3.5           103
#> 7624      2500  1100000             5.0           154
#> 7625      2500   872500             4.5           144
#> 7626      2500   420000             4.5           115
#> 7627      2500  1450000             5.5           198
#> 7628      2500   885500             5.5           130
#> 7629      2500   872500             4.5           138
#> 7630      2500   892500             4.5           144
#> 7631      2500   885500             5.5           130
#> 7632      2500   887500             5.5           130
#> 7633      2500  1050000             4.5           121
#> 7634      2500   877500             4.5           138
#> 7635      2500   870500             4.5           125
#> 7636      2500   887500             4.5           144
#> 8328      3000   820000             5.5           165
#> 8329      3000  1140000             3.5           115
#> 8330      3000  1090000             3.5           115
#> 8331      3000  1090000             5.5           193
#> 8332      3000  1090000             5.5           193
#> 8333      3000   720000             3.5           102
#> 8334      3000   920000             4.5           157
#> 8335      3000   920000             4.5           157
#> 8336      3000  1590000             5.5           330
#> 10437     4000   975000             4.5           125
#> 10438     4000   180000             3.0            70
#> 10439     4000  2100000             6.5           360
#> 12362     5201   725000             3.5            95
#> 13215     6000   695000             4.5           133
#> 13968     6511   440000             2.0            64
#> 14244     6547 15000000             7.5           220
#> 14562     6602  2800000             6.5           250
#> 14563     6602  2800000             7.5           242
#> 14564     6602   450000             3.5            75
#> 14565     6602   270000             1.5            28
#> 14566     6604   760000             3.5            78
#> 14567     6604  1990000             4.5           220
#> 14568     6604  2668590             5.5           290
#> 16581     6901  3660930             4.5           290
#> 16582     6901  3660930             4.5           290
#> 16583     6903   790000             3.5           105
#> 16584     6907   995000             4.5           114
#> 16585     6907   995000             4.5           114
#> 16586     6911   469350             5.5           140
#> 16587     6911   737550             4.5            82
#> 16588     6911   660000             7.5           200
#> 16589     6911   610000             3.5           103
#> 17900     7133  2266290             5.5           160
#> 17909     7135  2690000             8.5           236
#> 18169     8000  2100000             4.5           152
#> 18170     8000  1650000             4.5           142
#> 18171     8000   925000             3.5           102
#> 18172     8000  1650000             4.5           142
#> 18173     8000  1150000             4.5           128
#> 18174     8000  1450000             5.5           143
#> 18175     8000  1990000             5.5           200
#> 18176     8000   975000             4.5           122
#> 18177     8000  1990000             5.5           200
#> 18178     8000  2495000             5.5           482
#> 18658     8238   245000             2.0            49
#> 19082     8423  2110000             6.5           204
#> 19083     8423  2190000             5.5           167
#> 20296     9241   545000             4.5           100
#> 20297     9241   730840             5.5           130
#>                                                  address
#> 1                                       1000 Lausanne 25
#> 2                                       1000 Lausanne 25
#> 3                                       1000 Lausanne 26
#> 4                          Lausanne 26, 1000 Lausanne 26
#> 5                    Via Cuolm Liung 30d, 7032 Laax GR 2
#> 6                                         7032 Laax GR 2
#> 7                       Via Murschetg 29, 7032 Laax GR 2
#> 230                                        1014 Lausanne
#> 1137                                         1200 Genève
#> 1138  Chemin des pralets, 74100 Etrembières, 1200 Genève
#> 1139                                         1200 Genève
#> 5481                                       1919 Martigny
#> 5482                                       1919 Martigny
#> 5483                                       1919 Martigny
#> 5484                                       1919 Martigny
#> 7624                                    2500 Biel/Bienne
#> 7625                                    2500 Biel/Bienne
#> 7626                                    2500 Biel/Bienne
#> 7627                                         2500 Bienne
#> 7628                                    2500 Biel/Bienne
#> 7629                                    2500 Biel/Bienne
#> 7630                                    2500 Biel/Bienne
#> 7631                                    2500 Biel/Bienne
#> 7632                                    2500 Biel/Bienne
#> 7633                     Hohlenweg 11b, 2500 Biel/Bienne
#> 7634                                    2500 Biel/Bienne
#> 7635                                    2500 Biel/Bienne
#> 7636                                    2500 Biel/Bienne
#> 8328                                           3000 Bern
#> 8329                                           3000 Bern
#> 8330                                           3000 Bern
#> 8331                                           3000 Bern
#> 8332                                           3000 Bern
#> 8333                                           3000 Bern
#> 8334                                           3000 Bern
#> 8335                                           3000 Bern
#> 8336                                           3000 Bern
#> 10437                                         4000 Basel
#> 10438           Lörrach Brombach Steinsack 6, 4000 Basel
#> 10439                                         4000 Basel
#> 12362                                      5201 Brugg AG
#> 13215   in TRIENGEN, ca. 20 min. bei Luzern, 6000 Luzern
#> 13968                                     6511 Cadenazzo
#> 14244                               Augio 1F, 6547 Augio
#> 14562                                       6602 Muralto
#> 14563                                       6602 Muralto
#> 14564                      Via Bacilieri 2, 6602 Muralto
#> 14565                                       6602 Muralto
#> 14566                                       6604 Locarno
#> 14567                                       6604 Solduno
#> 14568                                       6604 Solduno
#> 16581                                        6901 Lugano
#> 16582                                        6901 Lugano
#> 16583                                        6903 Lugano
#> 16584                                      6907 MASSAGNO
#> 16585                                      6907 MASSAGNO
#> 16586                             6911 Campione d'Italia
#> 16587                             6911 Campione d'Italia
#> 16588                             6911 Campione d'Italia
#> 16589                             6911 Campione d'Italia
#> 17900                  Inder Platenga 34, 7133 Obersaxen
#> 17909                                       7135 Fideris
#> 18169                                        8000 Zürich
#> 18170                                        8000 Zürich
#> 18171                                        8000 Zürich
#> 18172                                        8000 Zürich
#> 18173                                        8000 Zürich
#> 18174                                        8000 Zürich
#> 18175                                        8000 Zürich
#> 18176                                        8000 Zürich
#> 18177                                        8000 Zürich
#> 18178                                        8000 Zürich
#> 18658      Stemmerstrasse 14, 8238 Büsingen am Hochrhein
#> 19082                      Chüngstrasse 60, 8423 Embrach
#> 19083                      Chüngstrasse 48, 8423 Embrach
#> 20296                                       9241 Kradolf
#> 20297                                       9241 Kradolf
#>             canton    property_type floor year_category Community
#> 1             Vaud     Single house           1919-1945      <NA>
#> 2             Vaud            Villa           2006-2010      <NA>
#> 3             Vaud        Apartment noteg     2016-2024      <NA>
#> 4             Vaud            Villa           1961-1970      <NA>
#> 5          Grisons        Apartment    eg     2016-2024      <NA>
#> 6          Grisons        Apartment noteg     2016-2024      <NA>
#> 7          Grisons        Apartment noteg     2011-2015      <NA>
#> 230           Vaud        Apartment    eg     2011-2015      <NA>
#> 1137        Geneva     Single house           2011-2015      <NA>
#> 1138        Geneva Bifamiliar house           2016-2024      <NA>
#> 1139        Geneva Bifamiliar house           1981-1990      <NA>
#> 5481        Valais       Attic flat noteg     2016-2024      <NA>
#> 5482        Valais        Apartment noteg     2016-2024      <NA>
#> 5483        Valais        Apartment noteg     2016-2024      <NA>
#> 5484        Valais        Apartment noteg     2016-2024      <NA>
#> 7624          Bern     Single house           2001-2005      <NA>
#> 7625          Bern            Villa           2016-2024      <NA>
#> 7626          Bern        Apartment noteg     1971-1980      <NA>
#> 7627          Bern     Single house           2016-2024      <NA>
#> 7628          Bern            Villa           2016-2024      <NA>
#> 7629          Bern     Single house           2016-2024      <NA>
#> 7630          Bern     Single house           2016-2024      <NA>
#> 7631          Bern     Single house           2016-2024      <NA>
#> 7632          Bern     Single house           2016-2024      <NA>
#> 7633          Bern     Single house           2001-2005      <NA>
#> 7634          Bern     Single house           2016-2024      <NA>
#> 7635          Bern     Single house           2016-2024      <NA>
#> 7636          Bern     Single house           2016-2024      <NA>
#> 8328          Bern        Apartment noteg     2016-2024      <NA>
#> 8329          Bern        Apartment    eg     2016-2024      <NA>
#> 8330          Bern        Apartment    eg     2016-2024      <NA>
#> 8331          Bern        Roof flat noteg     2016-2024      <NA>
#> 8332          Bern        Apartment noteg     2016-2024      <NA>
#> 8333          Bern        Apartment    eg     2016-2024      <NA>
#> 8334          Bern           Duplex noteg     2016-2024      <NA>
#> 8335          Bern        Apartment noteg     2016-2024      <NA>
#> 8336          Bern        Apartment noteg     1991-2000      <NA>
#> 10437  Basel-Stadt     Single house           2016-2024      <NA>
#> 10438  Basel-Stadt     Single house           1961-1970      <NA>
#> 10439  Basel-Stadt            Villa           2016-2024      <NA>
#> 12362       Aargau        Apartment noteg     2016-2024      <NA>
#> 13215      Lucerne        Apartment noteg     1991-2000      <NA>
#> 13968       Ticino        Apartment noteg     2016-2024      <NA>
#> 14244      Grisons     Single house           2016-2024      <NA>
#> 14562       Ticino     Single house           1981-1990      <NA>
#> 14563       Ticino     Single house           1981-1990      <NA>
#> 14564       Ticino        Apartment noteg     1946-1960      <NA>
#> 14565       Ticino        Apartment    eg     1961-1970      <NA>
#> 14566       Ticino        Apartment noteg     2011-2015      <NA>
#> 14567       Ticino       Attic flat noteg     2011-2015      <NA>
#> 14568       Ticino        Apartment noteg     2011-2015      <NA>
#> 16581       Ticino       Attic flat noteg     2011-2015      <NA>
#> 16582       Ticino        Apartment noteg     2011-2015      <NA>
#> 16583       Ticino        Apartment noteg     2006-2010      <NA>
#> 16584       Ticino        Apartment noteg     2016-2024      <NA>
#> 16585       Ticino        Apartment noteg     2016-2024      <NA>
#> 16586       Ticino        Apartment noteg     1946-1960      <NA>
#> 16587       Ticino        Apartment noteg     1991-2000      <NA>
#> 16588       Ticino     Single house           1971-1980      <NA>
#> 16589       Ticino        Apartment    eg     1946-1960      <NA>
#> 17900      Grisons     Single house           2006-2010      <NA>
#> 17909      Grisons     Single house              0-1919      <NA>
#> 18169       Zurich        Apartment noteg     2016-2024      <NA>
#> 18170       Zurich       Attic flat noteg     2016-2024      <NA>
#> 18171       Zurich        Apartment noteg     2016-2024      <NA>
#> 18172       Zurich        Apartment noteg     2016-2024      <NA>
#> 18173       Zurich        Apartment noteg     2016-2024      <NA>
#> 18174       Zurich        Apartment    eg     2016-2024      <NA>
#> 18175       Zurich        Apartment noteg     2006-2010      <NA>
#> 18176       Zurich     Single house           2016-2024      <NA>
#> 18177       Zurich       Attic flat noteg     2006-2010      <NA>
#> 18178       Zurich        Apartment noteg        0-1919      <NA>
#> 18658 Schaffhausen        Apartment noteg     1961-1970      <NA>
#> 19082       Zurich Bifamiliar house           2016-2024      <NA>
#> 19083       Zurich     Single house           2016-2024      <NA>
#> 20296      Thurgau        Apartment noteg     1991-2000      <NA>
#> 20297      Thurgau        Apartment noteg     1991-2000      <NA>
#>       Canton_code lon lat
#> 1            <NA>  NA  NA
#> 2            <NA>  NA  NA
#> 3            <NA>  NA  NA
#> 4            <NA>  NA  NA
#> 5            <NA>  NA  NA
#> 6            <NA>  NA  NA
#> 7            <NA>  NA  NA
#> 230          <NA>  NA  NA
#> 1137         <NA>  NA  NA
#> 1138         <NA>  NA  NA
#> 1139         <NA>  NA  NA
#> 5481         <NA>  NA  NA
#> 5482         <NA>  NA  NA
#> 5483         <NA>  NA  NA
#> 5484         <NA>  NA  NA
#> 7624         <NA>  NA  NA
#> 7625         <NA>  NA  NA
#> 7626         <NA>  NA  NA
#> 7627         <NA>  NA  NA
#> 7628         <NA>  NA  NA
#> 7629         <NA>  NA  NA
#> 7630         <NA>  NA  NA
#> 7631         <NA>  NA  NA
#> 7632         <NA>  NA  NA
#> 7633         <NA>  NA  NA
#> 7634         <NA>  NA  NA
#> 7635         <NA>  NA  NA
#> 7636         <NA>  NA  NA
#> 8328         <NA>  NA  NA
#> 8329         <NA>  NA  NA
#> 8330         <NA>  NA  NA
#> 8331         <NA>  NA  NA
#> 8332         <NA>  NA  NA
#> 8333         <NA>  NA  NA
#> 8334         <NA>  NA  NA
#> 8335         <NA>  NA  NA
#> 8336         <NA>  NA  NA
#> 10437        <NA>  NA  NA
#> 10438        <NA>  NA  NA
#> 10439        <NA>  NA  NA
#> 12362        <NA>  NA  NA
#> 13215        <NA>  NA  NA
#> 13968        <NA>  NA  NA
#> 14244        <NA>  NA  NA
#> 14562        <NA>  NA  NA
#> 14563        <NA>  NA  NA
#> 14564        <NA>  NA  NA
#> 14565        <NA>  NA  NA
#> 14566        <NA>  NA  NA
#> 14567        <NA>  NA  NA
#> 14568        <NA>  NA  NA
#> 16581        <NA>  NA  NA
#> 16582        <NA>  NA  NA
#> 16583        <NA>  NA  NA
#> 16584        <NA>  NA  NA
#> 16585        <NA>  NA  NA
#> 16586        <NA>  NA  NA
#> 16587        <NA>  NA  NA
#> 16588        <NA>  NA  NA
#> 16589        <NA>  NA  NA
#> 17900        <NA>  NA  NA
#> 17909        <NA>  NA  NA
#> 18169        <NA>  NA  NA
#> 18170        <NA>  NA  NA
#> 18171        <NA>  NA  NA
#> 18172        <NA>  NA  NA
#> 18173        <NA>  NA  NA
#> 18174        <NA>  NA  NA
#> 18175        <NA>  NA  NA
#> 18176        <NA>  NA  NA
#> 18177        <NA>  NA  NA
#> 18178        <NA>  NA  NA
#> 18658        <NA>  NA  NA
#> 19082        <NA>  NA  NA
#> 19083        <NA>  NA  NA
#> 20296        <NA>  NA  NA
#> 20297        <NA>  NA  NA

In the data processing steps described, we identified and subsequently removed 77 instances where the Community field was NaN after attempting to merge the property listings with the AMTOVZ_CSV_LV95 dataset.

Reasons for NaN Occurrences:

  • Zip Code Not Found in the AMTOVZ_CSV_LV95 Dataset. Occurs when the property’s zip code isn’t in the dataset, possibly due to outdated or incorrect listings.
  • Incorrectly Isolated Zip Code from the Address. Errors in extracting zip codes, often due to irregular address formatting.

Removing entries with NaN values ensures dataset accuracy and reliability for further analysis, crucial in real estate market analysis where geographical accuracy is paramount. Removed them

Click to show code
#remove the rows with nan in city
properties_filtered <- df[!is.na(df$Community),]
reactable(head(properties_filtered, 100))

2.1.4 Tax data

  • source https://swisstaxcalculator.estv.admin.ch/#/taxdata/tax-rates
  • ajouter description
  • expliquer blabla

2.1.4.1 Cleaning

Click to show code
# read csv
impots <- read.csv(file.path(here(),"data/estv_income_rates.csv"), sep = ",", header = TRUE, stringsAsFactors = FALSE)

# Remove 1st row
impots <- impots[-1, ]
# Remove 3rd column
impots <- impots[, -3]

# Combine text for columns 4-8
impots[1, 4:8] <- "Impôt sur le revenu"
# Combine text for columns 9-13
impots[1, 9:13] <- "Impôt sur la fortune"
# Combine text for columns 14-16
impots[1, 14:16] <- "Impôt sur le bénéfice"
# Combine text for columns 17-19
impots[1, 17:19] <- "Impôt sur le capital"

# Combine content of the first 2 rows into the 2nd row
impots[2, ] <- apply(impots[1:2, ], 2, function(x) paste(ifelse(is.na(x[1]), x[2], ifelse(is.na(x[2]), x[1], paste(x[1], x[2], sep = " ")))))

# Remove 1st row
impots <- impots[-1, ]

# Assign the text to the 1st row and 1st column
impots[1, 1] <- "Coefficient d'impôt en %"
# Replace column names with the content of the first row
colnames(impots) <- impots[1, ]
impots <- impots[-1, ]

# Check for missing values in impots
any_missing <- any(is.na(impots))

if (any_missing) {
  print("There are missing values in impots.")
} else {
  print("There are no missing values in impots.")
}
#> [1] "There are no missing values in impots."


# Replace row names with the content of the 3rd column
row.names(impots) <- impots[, 3]
impots <- impots[, -3]

# Remove 2nd column (to avoid canton column)
impots <- impots[, -2]
# Remove impot eglise
impots <- impots[, -c(4:6)]
impots <- impots[, -c(6:8)]
impots <- impots[, -8]
impots <- impots[, -10]
# Clean data and convert to numeric
cleaned_impots <- apply(impots, 2, function(x) as.numeric(gsub("[^0-9.-]", "", x)))

# Replace NA values with 0
cleaned_impots[is.na(cleaned_impots)] <- 0

# Check for non-numeric values
non_numeric <- sum(!is.na(cleaned_impots) & !is.numeric(cleaned_impots))
if (non_numeric > 0) {
  print(paste("Warning: Found", non_numeric, "non-numeric values."))
}

rownames(cleaned_impots) <- rownames(impots)

#reactable(head(cleaned_impots, 100))

2.1.5 Commune Data

2.1.5.1 Cleaning

  • ajouter source
  • ajouter description
  • expliquer blabla

Replaces NAs in both Taux de couverture social and Political (Conseil National Datas) For Taux de couverture Social: NAs were due to reason “Q” = “Not indicated to protect confidentiality” We replaced the NAs by the average taux de couverture in Switzerland in 2019, which was 3.2%

For Political data: NAs were due to reason “M” = “Not indicated because data was not important or applicable” Therefore, we replaced the NAs by 0

Click to show code
# il faudra changer le path
commune_prep <- read.csv(file.path(here(),"data/commune_data.csv"), sep = ";", header = TRUE, stringsAsFactors = FALSE)

# We keep only 2019 to have some reference? (2020 is apparently not really complete)
commune_2019 <- subset(commune_prep, PERIOD_REF == "2019") %>%
  select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE", "STATUS"))

# delete les lignes ou Status = Q ou M (pas de valeur) et ensuite on enlève la colonne
commune_2019 <- subset(commune_2019, STATUS == "A") %>%
  select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE"))

# on enlève les lignes qui sont des aggrégats
commune_2019 <- subset(commune_2019, REGION != "Schweiz")

commune_2019 <- commune_2019 %>%
  pivot_wider(names_from = INDICATORS, values_from = VALUE)

# Rename columns using the provided map
commune <- commune_2019 %>%
  rename(`Population - Habitants` = Ind_01_01,
         `Population - Densité de la population` = Ind_01_03,
         `Population - Etrangers` = Ind_01_08,
         `Population - Part du groupe d'âge 0-19 ans` = Ind_01_04,
         `Population - Part du groupe d'âge 20-64 ans` = Ind_01_05,
         `Population - Part du groupe d'âge 65+ ans` = Ind_01_06,
         `Population - Taux brut de nuptialité` = Ind_01_09,
         `Population - Taux brut de divortialité` = Ind_01_10,
         `Population - Taux brut de natalité` = Ind_01_11,
         `Population - Taux brut de mortalité` = Ind_01_12,
         `Population - Ménages privés` = Ind_01_13,
         `Population - Taille moyenne des ménages` = Ind_01_14,
         `Sécurité sociale - Taux d'aide sociale` = Ind_11_01,
         `Conseil national - PLR` = Ind_14_01,
         `Conseil national - PDC` = Ind_14_02,
         `Conseil national - PS` = Ind_14_03,
         `Conseil national - UDC` = Ind_14_04,
         `Conseil national - PEV/PCS` = Ind_14_05,
         `Conseil national - PVL` = Ind_14_06,
         `Conseil national - PBD` = Ind_14_07,
         `Conseil national - PST/Sol.` = Ind_14_08,
         `Conseil national - PES` = Ind_14_09,
         `Conseil national - Petits partis de droite` = Ind_14_10)

# If no one voted for a party, set as NA -> replacing it with 0 instead
commune <- commune %>%
  mutate_at(vars(starts_with("Conseil national")), ~replace_na(., 0))


# Removing NAs from Taux de couverture sociale column
# Setting the mean as the mean for Switzerland in 2019 (3.2%)
mean_taux_aide_social <- 3.2

# Replace NA values with the mean
commune <- commune %>%
  mutate(`Sécurité sociale - Taux d'aide sociale` = if_else(is.na(`Sécurité sociale - Taux d'aide sociale`), mean_taux_aide_social, `Sécurité sociale - Taux d'aide sociale`))
#show 100 first rows of commune using reactable
reactable(head(commune, 100))
Click to show code

# commune_prep <- read.csv(file.path(here(),"data/commune_data.csv"), sep = ";", header = TRUE, stringsAsFactors = FALSE)
# 
# # We keep only 2019 to have some reference? (2020 is apparently not really complete)
# commune_2019 <- subset(commune_prep, PERIOD_REF == "2019") %>%
#   select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE", "STATUS"))
# 
# # delete les lignes ou Status = Q ou M (pas de valeur) et ensuite on enlève la colonne
# commune_2019 <- subset(commune_2019, STATUS == "A") %>%
#   select(c("REGION", "CODE_REGION", "INDICATORS", "VALUE"))
# 
# # on enlève les lignes qui sont des aggrégats
# commune_2019 <- subset(commune_2019, REGION != "Schweiz")
# 
# commune_2019 <- commune_2019 %>%
#   pivot_wider(names_from = INDICATORS, values_from = VALUE)
# 
# # Rename columns using the provided map
# commune <- commune_2019 %>%
#   rename(`Population - Habitants` = Ind_01_01,
#          `Population - Densité de la population` = Ind_01_03,
#          `Population - Etrangers` = Ind_01_08,
#          `Population - Part du groupe d'âge 0-19 ans` = Ind_01_04,
#          `Population - Part du groupe d'âge 20-64 ans` = Ind_01_05,
#          `Population - Part du groupe d'âge 65+ ans` = Ind_01_06,
#          `Population - Taux brut de nuptialité` = Ind_01_09,
#          `Population - Taux brut de divortialité` = Ind_01_10,
#          `Population - Taux brut de natalité` = Ind_01_11,
#          `Population - Taux brut de mortalité` = Ind_01_12,
#          `Population - Ménages privés` = Ind_01_13,
#          `Population - Taille moyenne des ménages` = Ind_01_14,
#          `Sécurité sociale - Taux d'aide sociale` = Ind_11_01,
#          `Conseil national - PLR` = Ind_14_01,
#          `Conseil national - PDC` = Ind_14_02,
#          `Conseil national - PS` = Ind_14_03,
#          `Conseil national - UDC` = Ind_14_04,
#          `Conseil national - PEV/PCS` = Ind_14_05,
#          `Conseil national - PVL` = Ind_14_06,
#          `Conseil national - PBD` = Ind_14_07,
#          `Conseil national - PST/Sol.` = Ind_14_08,
#          `Conseil national - PES` = Ind_14_09,
#          `Conseil national - Petits partis de droite` = Ind_14_10)
# 
# # If no one voted for a party, set as NA -> replacing it with 0 instead
# commune <- commune %>%
#   mutate_at(vars(starts_with("Conseil national")), ~replace_na(., 0))
# 
# 
# # Removing NAs from Taux de couverture sociale column
# # Setting the mean as the mean for Switzerland in 2019 (3.2%)
# mean_taux_aide_social <- 3.2
# 
# # Replace NA values with the mean
# commune <- commune %>%
#   mutate(`Sécurité sociale - Taux d'aide sociale` = if_else(is.na(`Sécurité sociale - Taux d'aide sociale`), mean_taux_aide_social, `Sécurité sociale - Taux d'aide sociale`))
# 

3 Unsupervised learning

  • Clustering and/or dimension reduction

We decided to

In order to explore the relationship between real estate prices and external factor, we decided to perform three unsupervised clustering methods on fiscal, demographic and political data sets for each Swiss municipalities. The resulting clusters are then included as features of our supervised models, as the municipalities within those clusters follow roughly the same behaviour.

3.1 Fiscal clustering

First, we performed a k-means clustering on the fiscal data set. The elbow method with the within-sum of squares resulted in 5 clusters.

Click to show code
# Clean data and convert to numeric
set.seed(123)
cleaned_impots <- apply(impots, 2, function(x) as.numeric(gsub("[^0-9.-]", "", x)))
cleaned_impots[is.na(cleaned_impots)] <- 0  # Replace NA values with 0

# Scale the features
scaled_impots <- scale(cleaned_impots)

# Perform k-means clustering
k <- 2  # Initial guess for the number of clusters
kmeans_model <- kmeans(scaled_impots, centers = k)

# Check within-cluster sum of squares (elbow method)
wss <- numeric(10)
for (i in 1:10) {
  kmeans_model <- kmeans(scaled_impots, centers = i)
  wss[i] <- sum(kmeans_model$withinss)
}
plot(1:10, wss, type = "b", xlab = "Number of Clusters", ylab = "Within groups sum of squares")

# Adjust k based on elbow method
k <- 5  

# Perform k-means clustering again with optimal k
kmeans_model <- kmeans(scaled_impots, centers = k)

# Assign cluster labels to dendrogram
clusters <- kmeans_model$cluster

# Plot dendrogram
#colored_dend <- color_branches(dend, k = 5)
#y_zoom_range <- c(0, 80)  # Adjust the y-axis range as needed

#plot(colored_dend, main = "Hierarchical Clustering Dendrogram", horiz = FALSE, ylim = y_zoom_range)

Click to show code
# Get the cluster centers
cluster_centers <- kmeans_model$centers

# Create a data frame with cluster centers
cluster_centers_df <- data.frame(cluster = 1:k, cluster_centers)

# Print cluster centers
# print(cluster_centers_df)

# Calculate the size of each cluster
cluster_sizes <- table(kmeans_model$cluster)

# Print cluster sizes
# print(cluster_sizes)

# Get the cluster labels
cluster_labels <- kmeans_model$cluster

# Convert cleaned_impots to a data frame
impots_cluster <- as.data.frame(cleaned_impots)

# Add the cluster labels to cleaned_impots
impots_cluster$cluster <- cluster_labels

rownames(impots_cluster) <- rownames(impots)

impots_cluster <- impots_cluster %>%
  rownames_to_column(var = "Community")

3.2 Demographic clustering

Then, we performed a hierarchical clustering. First, the data was scaled (some features are percentages, some are real values), then the dissimilarity matrix was computed using the Minkowski method, then Ward’s method was used for the linkage.

As the optimal number of clusters for the fiscal data set was determined to be 5, we decided to continue our analysis of the two other data sets with 5 clusters in order to keep the same scale (even though categorical) for the 3 features resulting from the unsupervised clustering. ::: {.cell layout-align=“center”}

Click to show code
# Clustering demographic
cols_commune_demographic <- select(commune, -c("REGION", "CODE_REGION","Conseil national - PLR","Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))

# Scale the columns, some are total numbers, some are percentages
cols_commune_demographic <- scale(cols_commune_demographic)

# Calculate the distance matrix
dist_matrix_demographic <- dist(cols_commune_demographic, method = "minkowski")

# Perform hierarchical clustering
hclust_model_demographic <- hclust(dist_matrix_demographic, method = "ward.D2")

# Create dendrogram
dend_demo <- as.dendrogram(hclust_model_demographic)
dend_demo <- color_branches(dend_demo, k = 5) #Set number of cluster to 5, to keep the same scale for all our variables

par(mar = c(0.001, 4, 4, 2) + 0.1)
plot(dend_demo, main = "Demographics - Hierarchical Clustering Dendrogram", xlab = "")

:::

3.3 Political clustering

The same process was used for our political data set. The share of each major parties voted for the Conseil National are represented. The only difference was that we did not scale our data, as all features are percentages. ::: {.cell layout-align=“center”}

Click to show code
# Clustering politics

cols_commune_politics <- select(commune, c("Conseil national - PLR","Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))


# Calculate the distance matrix
dist_matrix_politics <- dist(cols_commune_politics, method = "minkowski")

# Perform hierarchical clustering
hclust_model_politics <- hclust(dist_matrix_politics, method = "ward.D2")

# Create dendrogram
dend_pol <- as.dendrogram(hclust_model_politics)
dend_pol <- color_branches(dend_pol, k = 5) #Set number of cluster to 5, to keep the same scale for all our variables

par(mar = c(0.001, 4, 4, 2) + 0.1)
plot(dend_pol, main = "Politics - Hierarchical Clustering Dendrogram")

:::

Click to show code
# Preparing df_commune for merging with main dataset

df_commune <- select(commune, REGION)

df_commune$Demographic_cluster <- cutree(hclust_model_demographic, k = 5)
df_commune$Political_cluster <- cutree(hclust_model_politics, k = 5)

# Preparing to merge

merging <- inner_join(amto_df, df_commune, by = c("Community" = "REGION"))

impots_cluster_subset <- impots_cluster[, c("Community", "cluster")]
merging <- merging %>%
  left_join(impots_cluster_subset, by = "Community")

clusters_df <- merging %>%
  rename(Tax_cluster = cluster) %>%
  rename(Commune = Community)

clusters_df <- clusters_df %>%
  select(c("Commune", "zip_code", "Canton_code", "Demographic_cluster", "Political_cluster", "Tax_cluster"))

# Only NAs are for commune Brugg, (written Brugg (AG) in the other data set) -> j'entre le cluster à la mano
clusters_df$Tax_cluster[is.na(clusters_df$Tax_cluster)] <- 2

# adding it to our main data set:
properties_filtered <- merge(properties_filtered, clusters_df[, c("zip_code", "Demographic_cluster", "Political_cluster", "Tax_cluster")], by = "zip_code", all.x = TRUE)
Click to show code
# Dropping 228 rows containing NAs after the merge (Problem with names)

# Find rows with NA values in the specified columns
na_rows <- subset(properties_filtered, is.na(Demographic_cluster) | is.na(Political_cluster) | is.na(Tax_cluster))

# Drop the NA rows
properties_filtered <- anti_join(properties_filtered, na_rows, by = "zip_code")
Click to show code
# Interpretaion of demographic clusters
demographic_vars <- select(commune, -c("REGION", "CODE_REGION", "Conseil national - PLR", "Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))


# Scale the variables
scaled_demographic_vars <- scale(demographic_vars)

# Convert to data frame
scaled_demographic_vars <- as.data.frame(scaled_demographic_vars)

# Add demographic cluster labels
scaled_demographic_vars$Demographic_cluster <- cutree(hclust_model_demographic, k = 5)

# Melt the dataset to long format
melted_demographic <- melt(scaled_demographic_vars, id.vars = "Demographic_cluster")

# Define color palette
my_colors <- c("#a6cee3", "#b2df8a", "#fb9a99", "#fdbf6f", "#cab2d6")

# Calculate the number of boxplots
num_boxplots <- length(unique(melted_demographic$variable))

# Calculate the number of rows and columns for the layout
num_rows <- ceiling(sqrt(num_boxplots))
num_cols <- ceiling(num_boxplots / num_rows)

# Set up the layout with reduced margins
par(mfrow = c(num_rows, num_cols), mar = c(5, 5, 1, 1))

# Create boxplots for each variable
for (variable in unique(melted_demographic$variable)) {
  # Calculate quantiles for each combination of variable and cluster
  quantiles <- tapply(melted_demographic$value[melted_demographic$variable == variable], melted_demographic$Demographic_cluster[melted_demographic$variable == variable], quantile, c(0.05, 0.95))
  
  # Determine ylim for each plot
  ylim_min <- min(unlist(quantiles))
  ylim_max <- max(unlist(quantiles))
  
  # Create boxplot with custom colors
  boxplot(value ~ Demographic_cluster, data = melted_demographic[melted_demographic$variable == variable,],
          main = variable,
          xlab = "",
          ylab = variable,
          ylim = c(ylim_min, ylim_max),
          col = my_colors)
}

The unsupervised clustering method performed on the demographic data of Swiss municipalities return some interesting results.

  • Our first cluster seems to be for municipalities where a lot of families with children live (“Part du group d’âge 0-19 ans” is high, “Taille moyenne des ménages high)

  • Cluster 2 and 3 are very similar, with a lot of variables showing no special indication. It is however to note that municipalities in cluster 2 have slightly higher population density than cluster 3, with more foreign inhabitants.

  • Cluster 4 seems to be for municipalities of large cities (Large and dense population, with most of its inhabitants being 20 to 64 years old)

  • Cluster 5 seems to be for municipalities with an aging population (“Part du groupe d’âge 65+ ans” and “Taux de mortalité” with high values)

Click to show code
# Subset your dataset to include only the variables used to create the political clusters and the political cluster labels
political_vars <- select(commune, c("Conseil national - PLR","Conseil national - PDC", "Conseil national - PS", "Conseil national - UDC", "Conseil national - PEV/PCS", "Conseil national - PVL", "Conseil national - PBD", "Conseil national - PST/Sol.", "Conseil national - PES", "Conseil national - Petits partis de droite"))

colnames(political_vars) <- sub("Conseil national - ", "", colnames(cols_commune_politics))

# Add political cluster labels
political_vars$Political_cluster <- cutree(hclust_model_politics, k = 5)

# Melt the dataset to long format
melted_political <- melt(political_vars, id.vars = "Political_cluster")

# Define pastel color palette
pastel_colors <- c("#a6cee3", "#b2df8a", "#fb9a99", "#fdbf6f", "#cab2d6")

# Set up the layout with reduced margins
par(mfrow = c(5, 2), mar = c(3, 3, 1, 1), pty = "s")

# Create boxplots for each variable
for (variable in unique(melted_political$variable)) {
  # Calculate quantiles for each combination of variable and cluster
  quantiles <- tapply(melted_political$value[melted_political$variable == variable], melted_political$Political_cluster[melted_political$variable == variable], quantile, c(0.05, 0.95))
  
  # Determine ylim for each plot
  ylim_min <- min(unlist(quantiles))
  ylim_max <- max(unlist(quantiles))
  
  # Create boxplot with pastel colors
  boxplot(value ~ Political_cluster, data = melted_political[melted_political$variable == variable,],
          main = variable,
          xlab = "Political Cluster",
          ylab = variable,
          ylim = c(ylim_min, ylim_max),
          col = pastel_colors)
}

The political clusters are more difficult to interpret than the demographic ones. It is however interesting to note for example cluster 3, having high values for PS and PST (lft-leaning parties), and low values for right-leaning parties), while cluster 2 has high values for right-leaning parties.

Click to show code
# Subset your dataset to include only the variables used to create the tax clusters and the tax cluster labels
tax_vars <- select(impots_cluster, -c("Community", "cluster"))

# Scale the variables
scaled_tax_vars <- scale(tax_vars)

# Convert to data frame
scaled_tax_vars <- as.data.frame(scaled_tax_vars)

# Add tax cluster labels
scaled_tax_vars$Tax_cluster <- impots_cluster$cluster

# Melt the dataset to long format
melted_tax <- melt(scaled_tax_vars, id.vars = "Tax_cluster")

my_colors <- c("#a6cee3", "#b2df8a", "#fb9a99", "#fdbf6f", "#cab2d6")

# Set up the layout with reduced margins
par(mfrow = c(3, 3), mar = c(5, 5, 2, 2))

# Create boxplots for each variable
for (variable in unique(melted_tax$variable)) {
  # Calculate quantiles for each combination of variable and cluster
  quantiles <- tapply(melted_tax$value[melted_tax$variable == variable], melted_tax$Tax_cluster[melted_tax$variable == variable], quantile, c(0.05, 0.95))
  
  # Determine ylim for each plot
  ylim_min <- min(unlist(quantiles))
  ylim_max <- max(unlist(quantiles))
  
  boxplot(value ~ Tax_cluster, data = melted_tax[melted_tax$variable == variable,],
          main = variable,
          xlab = "",
          ylab = variable,
          ylim = c(ylim_min, ylim_max),
          col = my_colors)
}

The fiscal clusters are also quite difficult to interpret. The only striking cluster is cluster 5, with very high values for cantonal taxes, while having average communal (municipality) taxes.

4 EDA

4.1 Map representation of distribution of properties

Here we decided to represent the distribution of properties in Switzerland using a map. The map is interactive and allows users to hover over the markers to see the price. The markers are color-coded in orange and have a semi-transparent fill to reduce visual noise. The size of the markers is smaller to optimize the visual representation of the data.

This visualization helps in understanding the geographical spread and density of properties in the dataset.

Click to show code
# Create a leaflet map with optimized markers
map <- leaflet(properties_filtered) %>%
  addTiles() %>%  # Add default OpenStreetMap tiles
  addProviderTiles(providers$Esri.NatGeoWorldMap) %>% # Add topographic maps for context
  addCircleMarkers(
    ~lon, ~lat,
    radius = 1.5,  # Smaller radius for the circle markers
    color = "#F97300",  # Specifying a color for the markers
    fillOpacity = 0.2,  # Semi-transparent fill
    stroke = FALSE,  # No border to the circle markers to reduce visual noise
    popup = ~paste("Price: ", price, "<br>",
                   "Rooms: ", number_of_rooms, "<br>",
                   "Type: ", property_type, "<br>",
                   "Year: ", year_category),
    label = ~paste("Price: ", price)  # Tooltip on hover
  ) %>% addLegend(
    position = "bottomright",  # Position the legend at the bottom right
    colors = "#F97300",  # Use the same color as the markers
    labels = "Properties"  # Label for the legend
  )

map$width <- "100%"  # Set the width of the map to 100%
map$height <- 600  # Set the height of the map to 600 pixels

map

The map highlights that properties are well-distributed throughout the region, with fewer properties in the Alpine region, which is expected due to its mountainous terrain. We thus have a good representation of the data across different cantons and locations and we can use it for further analysis.

4.2 Histogram of prices

Click to show code
histogram_price <- ggplot(properties_filtered, aes(x = price)) +
  geom_histogram(binwidth = 100000, fill = "skyblue", color = "red") +
  labs(title = "Distribution of Prices",
       x = "Price",
       y = "Frequency") +
  theme_minimal()
# Convert ggplot object to plotly object
interactive_histogram_price <- ggplotly(histogram_price, width = 600, height = 500 )
# Display the interactive histogram
interactive_histogram_price

As we can see, the majority of the properties are less than 5 million.

4.3 Price between 0 and 5 millions

To enhance data visibility, we focus on the majority of the data between 5 and 10 million, while filtering out outliers.

4.3.1 Histogram of prices for each property type

Click to show code
# Define breaks for x-axis in millions
breaks <- seq(0, 5000000, by = 1000000)
labels <- paste0(breaks/1000000, "Mio")

# Create the ggplot object
histogram <- ggplot(properties_filtered, aes(x = price)) +
  geom_histogram(binwidth = 100000, fill = "skyblue", color = "black") +
  facet_wrap(~ property_type, scales = "free", ncol = 2) +
  labs(title = "Distribution of Prices by Property Type",
       x = "Price",
       y = "Frequency") +
  theme_minimal() +
  scale_x_continuous(breaks = breaks, labels = labels, limits = c(0, 5000000))

# Convert ggplot object to plotly object
interactive_histogram <- ggplotly(histogram, width = 750, height = 1000)
# Adjust margins to prevent x-axis label from being cut off
interactive_histogram <- layout(interactive_histogram, margin = list(l = 50, r = 50, b = 50, t = 50))

# Display the interactive plot
interactive_histogram

Upon first glance, the majority of property types are valued at less than 3 million, with apartments and single houses being the most frequent.

4.3.2 Histogram of prices for each year category

Click to show code
# Define breaks for x-axis in millions
breaks <- seq(0, 5000000, by = 1000000)
labels <- paste0(breaks/1000000, "Mio")

# Create a histogram of prices for each year category
histogram <- ggplot(properties_filtered, aes(x = price)) +
  geom_histogram(binwidth = 100000, fill = "skyblue", color = "black") +
  facet_wrap(~ year_category, scales = "free", ncol = 2) +
  labs(title = "Distribution of Prices by Year Category",
       x = "Price",
       y = "Frequency") +
  theme_minimal() +
  scale_x_continuous(breaks = breaks, labels = labels, limits = c(0, 5000000))

# Convert ggplot object to plotly object
interactive_histogram_year <- ggplotly(histogram, width = 750, height = 1000)
# Adjust margins to prevent x-axis label from being cut off
interactive_histogram <- layout(interactive_histogram, margin = list(l = 50, r = 50, b = 50, t = 50))
# Display the interactive plot
interactive_histogram_year

The majority of properties were built between 2016 and 2024. Interestingly, the distribution remains similar across almost all periods.

4.3.3 Histogram of prices for each canton

Click to show code
# Define breaks for x-axis in millions
breaks <- seq(0, 5200000, by = 1000000)
labels <- paste0(breaks/1000000, "Mio")

# Create the histogram
histogram <- ggplot(properties_filtered, aes(x = price)) +
  geom_histogram(binwidth = 100000, fill = "skyblue", color = "black") +
  facet_wrap(~ canton, scales = "free", ncol = 2) +
  labs(title = "Distribution by Canton for properties between 0 and 5 million",
       x = "Price",
       y = "Frequency") +
  theme(axis.text.y = element_text(size = 2)) +
  theme_minimal() +
  scale_x_continuous(breaks = breaks, labels = labels, limits = c(0, 5200000))

# Convert ggplot object to plotly object with adjusted height
interactive_histogram <- ggplotly(histogram, width = 750, height = 1000)
# Adjust margins to prevent x-axis label from being cut off
interactive_histogram <- layout(interactive_histogram, margin = list(l = 50, r = 50, b = 50, t = 50))
# Display the interactive plot
interactive_histogram

Compared to other cantons, Geneva has a distinct distribution. Canton Vaud, Valais, Tessin, Bern, and Fribourg are where the majority of the listed properties are concentrated.

4.3.4 Histogram of prices for each number of rooms

Click to show code
# Define breaks for x-axis in millions
breaks <- seq(0, 5200000, by = 1000000)
labels <- paste0(breaks/1000000, "Mio")

# Create the histogram
histogram <- ggplot(properties_filtered, aes(x = price)) +
  geom_histogram(binwidth = 100000, fill = "skyblue", color = "black") +
  facet_wrap(~ number_of_rooms, scales = "free", ncol = 2) +
  labs(title = "Distribution of Prices by Number of Rooms",
       x = "Price",
       y = "Frequency") +
  theme_minimal() +
  scale_x_continuous(breaks = breaks, labels = labels, limits = c(0, 5200000))

# Convert ggplot object to plotly object with adjusted height
interactive_histogram <- ggplotly(histogram, width = 750, height = 2000) 
# Adjust margins to prevent x-axis label from being cut off
interactive_histogram <- layout(interactive_histogram, margin = list(l = 50, r = 50, b = 50, t = 50))
# Display the interactive plot
interactive_histogram

The majority of properties have fewer than 10 rooms, and the distribution tends to shift slightly towards higher prices as the number of rooms increases.

4.4 Histogram of properties by square meters

To better see the data, we only show the properties with less than 1000 square meters

Click to show code
histogram <- ggplot(properties_filtered, aes(x = square_meters)) +
  geom_histogram(binwidth = 15, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Properties by Square Meters",
       x = "Square Meters",
       y = "Frequency") +
  theme_minimal() +
  xlim(0,1000)

# Convert ggplot object to plotly object with adjusted height
interactive_histogram <- ggplotly(histogram, width = 750, height = 500 )  # Adjust width and height as needed
# Adjust margins to prevent x-axis label from being cut off
interactive_histogram <- layout(interactive_histogram, margin = list(l = 50, r = 50, b = 50, t = 50))
# Display the interactive plot
interactive_histogram

4.5 Histogram of prices by Tax Cluster

Click to show code

# Create the boxplot
boxplot <- ggplot(properties_filtered, aes(x = as.factor(Tax_cluster), y = price)) +
  geom_boxplot(fill = "skyblue", color = "black") +
  labs(title = "Boxplot of Property Prices by Tax Cluster",
       x = "Tax Cluster",
       y = "Price") +
  theme_minimal() +
  scale_y_continuous(labels = scales::unit_format(unit = "Mio", scale = 1e-6), limits = c(0, 4000000))

# Convert ggplot object to plotly object
interactive_boxplot <- ggplotly(boxplot, width = 750, height = 500)

# Display the interactive plot
interactive_boxplot

4.6 Histogram of prices by Political cluster

Click to show code
# Create the boxplot
boxplot <- ggplot(properties_filtered, aes(x = as.factor(Political_cluster), y = price)) +
  geom_boxplot(fill = "skyblue", color = "black") +
  labs(title = "Boxplot of Property Prices by Political Cluster",
       x = "Political Cluster",
       y = "Price") +
  theme_minimal() +
  scale_y_continuous(labels = scales::unit_format(unit = "Mio", scale = 1e-6), limits = c(0, 4000000))

# Convert ggplot object to plotly object
interactive_boxplot <- ggplotly(boxplot, width = 750, height = 500 )
interactive_boxplot

4.7 Histogram of prices by demographic cluster

Click to show code
# Create the boxplot
boxplot <- ggplot(properties_filtered, aes(x = as.factor(Demographic_cluster), y = price)) +
  geom_boxplot(fill = "skyblue", color = "black") +
  labs(title = "Boxplot of Property Prices by demographic Cluster",
       x = "demographic Cluster",
       y = "Price") +
  theme_minimal() +
  scale_y_continuous(labels = scales::unit_format(unit = "Mio", scale = 1e-6), limits = c(0, 4000000))

# Convert ggplot object to plotly object
interactive_boxplot <- ggplotly(boxplot, width = 750, height = 500 )
interactive_boxplot

5 Supervised learning

  • Data splitting (if a training/test set split is enough for the global analysis, at least one CV or bootstrap must be used)
  • Two or more models
  • Two or more scores
  • Tuning of one or more hyperparameters per model
  • Interpretation of the model(s)

5.1 Model 1

This section provides a comprehensive outline of the linear regression model and analysis methods employed in the study of property price determinants.

5.1.1 Tools and Software

The study was conducted using R, a programming language and environment widely recognized for its robust capabilities in statistical computing and graphics. Key packages used include:

  • corrplot for visualization of correlation matrices, aiding in preliminary feature selection. car for diagnostic testing and variance inflation factor (VIF) analysis to detect multicollinearity among predictors.
  • caret for creating training and testing sets, and managing cross-validation processes.
  • ggplot2 and plotly for creating visual representations of model diagnostics, predictions, and residuals.
  • gtsummary for creating publication-ready tables summarizing regression analysis results.

Each of these tools was chosen for its specific functionality that aids in robust data analysis, ensuring that each step of the model building and evaluation process is well-supported.

5.1.2 Linear Regression - With All Prices

5.1.2.1 Correlation Analysis - Continuous Variable

Initially, a correlation analysis was conducted to identify and visualize linear relationships between the property prices and potential predictive variables such as the number of rooms and square meters. The correlation matrix was computed and plotted using the corrplot package:

Click to show code
correlation_matrix <- cor(properties_filtered[ , c("price", "number_of_rooms", "square_meters")], use="complete.obs")
corrplot(correlation_matrix, method="square", type="upper", tl.col="black", tl.srt=45, tl.cex=0.8, cl.cex=0.8, addCoef.col="black", number.cex=0.8, order="hclust", hclust.method="complete", tl.pos="lt", tl.offset=0.5, cl.pos="n", cl.length=1.5)

We can observe that the correlation between the number of rooms and price (0.02) and square meters and price (0.68) suggests a weak correlation with the number of rooms but a strong correlation with square meters. The number of rooms and square meters have a weak correlation (0.12), indicating they offer independent information for the model.

  • Question : How to make the correlation with categorical variables?
  • Question : Is VIF analysis really necessary and meaningful ?

####Basic Model ##### Model Building and Evaluation

The dataset was split into training and testing sets to validate the model’s performance. The linear regression model was then fitted using selected predictors: ::: {.cell layout-align=“center”}

Click to show code
# Model Building
## Split the Data
set.seed(123)  # for reproducibility
trainIndex <- createDataPartition(properties_filtered$price, p=0.8, list=FALSE)
trainData <- properties_filtered[trainIndex, ]
testData <- properties_filtered[-trainIndex, ]

## Fit the Model
lm_model <- lm(price ~ number_of_rooms + square_meters + property_type + floor + year_category  + Demographic_cluster +Political_cluster + Tax_cluster , data=trainData)

summary(lm_model)
#> 
#> Call:
#> lm(formula = price ~ number_of_rooms + square_meters + property_type + 
#>     floor + year_category + Demographic_cluster + Political_cluster + 
#>     Tax_cluster, data = trainData)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -18897993   -358366   -103746    195296  16784689 
#> 
#> Coefficients: (1 not defined because of singularities)
#>                               Estimate Std. Error t value Pr(>|t|)
#> (Intercept)                    -280169      51178   -5.47  4.5e-08
#> number_of_rooms                  25198       6341    3.97  7.1e-05
#> square_meters                     8494        106   79.81  < 2e-16
#> property_typeAttic flat         129021      41445    3.11  0.00185
#> property_typeBifamiliar house  -188611      39382   -4.79  1.7e-06
#> property_typeChalet             143905      49537    2.91  0.00368
#> property_typeDuplex             -96168      50397   -1.91  0.05638
#> property_typeFarm house        -436410     114539   -3.81  0.00014
#> property_typeLoft              -144419     261540   -0.55  0.58083
#> property_typeRoof flat          -74019      58645   -1.26  0.20691
#> property_typeRustic house      -109998     231519   -0.48  0.63471
#> property_typeSingle house      -135237      23261   -5.81  6.2e-09
#> property_typeTerrace flat       -54410      83645   -0.65  0.51539
#> property_typeVilla              192125      36913    5.20  2.0e-07
#> flooreg                          20745      23779    0.87  0.38300
#> floornoteg                          NA         NA      NA       NA
#> year_category1919-1945          236188      54611    4.32  1.5e-05
#> year_category1946-1960          294589      51365    5.74  9.9e-09
#> year_category1961-1970          239390      43517    5.50  3.8e-08
#> year_category1971-1980          315114      38792    8.12  4.9e-16
#> year_category1981-1990          297688      39214    7.59  3.3e-14
#> year_category1991-2000          355533      40758    8.72  < 2e-16
#> year_category2001-2005          498869      49211   10.14  < 2e-16
#> year_category2006-2010          560413      43331   12.93  < 2e-16
#> year_category2011-2015          592744      42111   14.08  < 2e-16
#> year_category2016-2024          453204      33141   13.68  < 2e-16
#> Demographic_cluster              73605       6927   10.63  < 2e-16
#> Political_cluster               -52929       5179  -10.22  < 2e-16
#> Tax_cluster                     -73874       7312  -10.10  < 2e-16
#>                                  
#> (Intercept)                   ***
#> number_of_rooms               ***
#> square_meters                 ***
#> property_typeAttic flat       ** 
#> property_typeBifamiliar house ***
#> property_typeChalet           ** 
#> property_typeDuplex           .  
#> property_typeFarm house       ***
#> property_typeLoft                
#> property_typeRoof flat           
#> property_typeRustic house        
#> property_typeSingle house     ***
#> property_typeTerrace flat        
#> property_typeVilla            ***
#> flooreg                          
#> floornoteg                       
#> year_category1919-1945        ***
#> year_category1946-1960        ***
#> year_category1961-1970        ***
#> year_category1971-1980        ***
#> year_category1981-1990        ***
#> year_category1991-2000        ***
#> year_category2001-2005        ***
#> year_category2006-2010        ***
#> year_category2011-2015        ***
#> year_category2016-2024        ***
#> Demographic_cluster           ***
#> Political_cluster             ***
#> Tax_cluster                   ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 976000 on 16449 degrees of freedom
#> Multiple R-squared:  0.476,  Adjusted R-squared:  0.475 
#> F-statistic:  553 on 27 and 16449 DF,  p-value: <2e-16

:::

Diagnostic checks such as residual analysis and normality tests were conducted to validate model assumptions. Performance metrics including R-squared and RMSE were calculated to assess the model’s explanatory power and prediction accuracy.

Click to show code
sum(is.na(testData$price))  # Check NAs in price
#> [1] 0
sapply(testData, function(x) sum(is.na(x)))  # Check NAs in all predictors
#>            zip_code               price     number_of_rooms 
#>                   0                   0                   0 
#>       square_meters             address              canton 
#>                   0                   0                   0 
#>       property_type               floor       year_category 
#>                   0                   0                   0 
#>           Community         Canton_code                 lon 
#>                   0                   0                   0 
#>                 lat Demographic_cluster   Political_cluster 
#>                   0                   0                   0 
#>         Tax_cluster 
#>                   0
Click to show code
# Model Evaluation
## Diagnostic Checks
#plot(lm_model)
#ad.test(residuals(lm_model))

#use gt summary to show the result
tbl_reg_1 <- gtsummary::tbl_regression(lm_model)
tbl_reg_1
Characteristic Beta 95% CI1 p-value
number_of_rooms 25,198 12,769, 37,626 <0.001
square_meters 8,494 8,285, 8,703 <0.001
property_type


    Apartment
    Attic flat 129,020 47,784, 210,257 0.002
    Bifamiliar house -188,611 -265,804, -111,418 <0.001
    Chalet 143,905 46,808, 241,002 0.004
    Duplex -96,168 -194,952, 2,617 0.056
    Farm house -436,410 -660,919, -211,902 <0.001
    Loft -144,419 -657,067, 368,228 0.6
    Roof flat -74,019 -188,970, 40,932 0.2
    Rustic house -109,998 -563,801, 343,804 0.6
    Single house -135,237 -180,832, -89,643 <0.001
    Terrace flat -54,410 -218,363, 109,543 0.5
    Villa 192,125 119,772, 264,479 <0.001
floor


    floor
    eg 20,745 -25,864, 67,355 0.4
    noteg


year_category


    0-1919
    1919-1945 236,188 129,145, 343,232 <0.001
    1946-1960 294,589 193,908, 395,269 <0.001
    1961-1970 239,390 154,092, 324,687 <0.001
    1971-1980 315,114 239,077, 391,150 <0.001
    1981-1990 297,688 220,824, 374,552 <0.001
    1991-2000 355,533 275,643, 435,424 <0.001
    2001-2005 498,869 402,410, 595,329 <0.001
    2006-2010 560,413 475,481, 645,346 <0.001
    2011-2015 592,744 510,202, 675,286 <0.001
    2016-2024 453,204 388,244, 518,164 <0.001
Demographic_cluster 73,605 60,028, 87,182 <0.001
Political_cluster -52,929 -63,080, -42,778 <0.001
Tax_cluster -73,874 -88,207, -59,542 <0.001
1 CI = Confidence Interval
5.1.2.1.1 Assess Overfitting
Click to show code
# For the Linear Model
lm_train_pred <- predict(lm_model, newdata = trainData)
lm_test_pred <- predict(lm_model, newdata = testData)

# Calculate RMSE and R-squared for Training Data
lm_train_rmse <- sqrt(mean((trainData$price - lm_train_pred)^2))
lm_train_rsquared <- summary(lm(lm_train_pred ~ trainData$price))$r.squared

# Calculate RMSE and R-squared for Test Data
lm_test_rmse <- sqrt(mean((testData$price - lm_test_pred)^2))
lm_test_rsquared <- summary(lm(lm_test_pred ~ testData$price))$r.squared

# show the results in a table
results_table <- data.frame(
  Model = c("Linear Regression"),
  RMSE_Train = lm_train_rmse,
  RMSE_Test = lm_test_rmse,
  Rsquared_Train = lm_train_rsquared,
  Rsquared_Test = lm_test_rsquared
)
results_table
#>               Model RMSE_Train RMSE_Test Rsquared_Train
#> 1 Linear Regression     975651    964081          0.476
#>   Rsquared_Test
#> 1         0.499
5.1.2.1.2 Metrics

Here are the performance metrics for the initial model: ::: {.cell layout-align=“center”}

Click to show code
# print R-squared and Adj R-squared and RMSE and MAE and AIC
r_sq <- summary(lm_model)$r.squared
adj_r_sq <- summary(lm_model)$adj.r.squared
rmse <- rmse(testData$price, predict(lm_model, newdata=testData))
mae <- mae(testData$price, predict(lm_model, newdata=testData))
aic <- AIC(lm_model)
# show those metrics in a html table
metrics_1 <- kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format = "html", col.names = c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))  %>%
  add_header_above(c("Basic Model Performance Metrics" = 5)) 
metrics_1
Basic Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.476 0.475 964081 488516 501282

:::

5.1.2.2 Hyperparameter Tuning

5.1.2.2.1 Stepwise Regression

Stepwise regression was performed to refine the model and improve its predictive performance. The resulting model was evaluated using the same diagnostic checks and performance metrics as the initial model:

Click to show code
# stepwise regression
lm_model2 <- step(lm_model)
#> Start:  AIC=454520
#> price ~ number_of_rooms + square_meters + property_type + floor + 
#>     year_category + Demographic_cluster + Political_cluster + 
#>     Tax_cluster
#> 
#>                       Df Sum of Sq      RSS    AIC
#> - floor                1  7.26e+11 1.57e+16 454519
#> <none>                             1.57e+16 454520
#> - number_of_rooms      1  1.51e+13 1.57e+16 454534
#> - Tax_cluster          1  9.73e+13 1.58e+16 454620
#> - Political_cluster    1  9.96e+13 1.58e+16 454622
#> - Demographic_cluster  1  1.08e+14 1.58e+16 454631
#> - property_type       10  1.30e+14 1.58e+16 454636
#> - year_category       10  2.99e+14 1.60e+16 454812
#> - square_meters        1  6.07e+15 2.18e+16 459911
#> 
#> Step:  AIC=454519
#> price ~ number_of_rooms + square_meters + property_type + year_category + 
#>     Demographic_cluster + Political_cluster + Tax_cluster
#> 
#>                       Df Sum of Sq      RSS    AIC
#> <none>                             1.57e+16 454519
#> - number_of_rooms      1  1.51e+13 1.57e+16 454533
#> - Tax_cluster          1  9.72e+13 1.58e+16 454619
#> - Political_cluster    1  9.98e+13 1.58e+16 454621
#> - Demographic_cluster  1  1.08e+14 1.58e+16 454630
#> - property_type       11  1.47e+14 1.58e+16 454650
#> - year_category       10  3.02e+14 1.60e+16 454813
#> - square_meters        1  6.07e+15 2.18e+16 459909

#use gt summary to show the result 
tbl_reg_2 <- gtsummary::tbl_regression(lm_model2)
tbl_reg_3 <- tbl_merge(
  tbls= list(tbl_reg_1, tbl_reg_2),
  tab_spanner = c("**Model 1**", "**Model Stepwise**")
  )
tbl_reg_3
Characteristic Model 1 Model Stepwise
Beta 95% CI1 p-value Beta 95% CI1 p-value
number_of_rooms 25,198 12,769, 37,626 <0.001 25,241 12,813, 37,669 <0.001
square_meters 8,494 8,285, 8,703 <0.001 8,493 8,285, 8,702 <0.001
property_type





    Apartment

    Attic flat 129,020 47,784, 210,257 0.002 124,293 43,754, 204,832 0.002
    Bifamiliar house -188,611 -265,804, -111,418 <0.001 -193,462 -269,881, -117,043 <0.001
    Chalet 143,905 46,808, 241,002 0.004 139,484 42,897, 236,071 0.005
    Duplex -96,168 -194,952, 2,617 0.056 -96,396 -195,178, 2,387 0.056
    Farm house -436,410 -660,919, -211,902 <0.001 -440,487 -664,807, -216,167 <0.001
    Loft -144,419 -657,067, 368,228 0.6 -143,544 -656,184, 369,096 0.6
    Roof flat -74,019 -188,970, 40,932 0.2 -78,755 -193,212, 35,702 0.2
    Rustic house -109,998 -563,801, 343,804 0.6 -113,914 -567,627, 339,800 0.6
    Single house -135,237 -180,832, -89,643 <0.001 -139,564 -184,110, -95,018 <0.001
    Terrace flat -54,410 -218,363, 109,543 0.5 -52,403 -216,292, 111,487 0.5
    Villa 192,125 119,772, 264,479 <0.001 187,713 116,043, 259,384 <0.001
floor





    floor



    eg 20,745 -25,864, 67,355 0.4


    noteg





year_category





    0-1919

    1919-1945 236,188 129,145, 343,232 <0.001 236,459 129,418, 343,501 <0.001
    1946-1960 294,589 193,908, 395,269 <0.001 294,540 193,860, 395,220 <0.001
    1961-1970 239,390 154,092, 324,687 <0.001 239,026 153,733, 324,318 <0.001
    1971-1980 315,114 239,077, 391,150 <0.001 314,454 238,432, 390,475 <0.001
    1981-1990 297,688 220,824, 374,552 <0.001 297,736 220,872, 374,599 <0.001
    1991-2000 355,533 275,643, 435,424 <0.001 355,709 275,821, 435,598 <0.001
    2001-2005 498,869 402,410, 595,329 <0.001 499,368 402,916, 595,820 <0.001
    2006-2010 560,413 475,481, 645,346 <0.001 560,818 475,891, 645,745 <0.001
    2011-2015 592,744 510,202, 675,286 <0.001 593,261 510,728, 675,794 <0.001
    2016-2024 453,204 388,244, 518,164 <0.001 455,027 390,197, 519,857 <0.001
Demographic_cluster 73,605 60,028, 87,182 <0.001 73,690 60,114, 87,266 <0.001
Political_cluster -52,929 -63,080, -42,778 <0.001 -52,977 -63,127, -42,826 <0.001
Tax_cluster -73,874 -88,207, -59,542 <0.001 -73,842 -88,174, -59,510 <0.001
1 CI = Confidence Interval
5.1.2.2.2 Lasso and Ridge Regression

A Lasso and Ridge regression were also performed to compare the performance of the linear regression model with regularization techniques. We fit both models using cross-validation to determine the optimal lambda (penalty parameter). The plots show the lambda selection process for both Lasso and Ridge models. ::: {.cell layout-align=“center”}

Click to show code
# Convert data frames to matrices for glmnet
dat_tr_re_mat_x <- as.matrix(trainData[, c("number_of_rooms", "square_meters", "floor", "year_category", "Demographic_cluster", "Political_cluster", "Tax_cluster")])
dat_tr_re_mat_y <- trainData$price

dat_te_re_mat_x <- as.matrix(testData[, c("number_of_rooms", "square_meters", "floor", "year_category", "Demographic_cluster", "Political_cluster", "Tax_cluster")])
dat_te_re_mat_y <- testData$price

#fit lasso and ridge
set.seed(123)  # For reproducibility

# Fit Lasso model
lasso_model <- cv.glmnet(dat_tr_re_mat_x, dat_tr_re_mat_y, alpha = 1) # Lasso
#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

# Fit Ridge model
ridge_model <- cv.glmnet(dat_tr_re_mat_x, dat_tr_re_mat_y, alpha = 0) # Ridge
#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion
#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion

ridge_fit_best <- glmnet(x=dat_tr_re_mat_x, y = dat_tr_re_mat_y, 
                         lambda = ridge_model$lambda.min)
#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

lasso_fit_best <- glmnet(x=dat_tr_re_mat_x, y=dat_tr_re_mat_y, 
                         lambda = lasso_model$lambda.min) #can also use lasso_fit$lambda.1se
#> Warning in storage.mode(xd) <- "double": NAs introduced by coercion

# lasso & ridge performance on the training set
postResample(predict(ridge_fit_best, newx = dat_tr_re_mat_x), dat_tr_re_mat_y)
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
#>     RMSE Rsquared      MAE 
#> 1.00e+06 4.50e-01 5.09e+05
postResample(predict(lasso_fit_best, newx = dat_tr_re_mat_x), dat_tr_re_mat_y)
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
#>     RMSE Rsquared      MAE 
#> 9.93e+05 4.57e-01 5.00e+05
# lasso & ridge performance on the test set
postResample(predict(ridge_fit_best, newx = dat_te_re_mat_x), dat_te_re_mat_y)
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
#>     RMSE Rsquared      MAE 
#> 9.96e+05 4.76e-01 5.00e+05
postResample(predict(lasso_fit_best, newx = dat_te_re_mat_x), dat_te_re_mat_y)
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
#>     RMSE Rsquared      MAE 
#> 9.82e+05 4.82e-01 4.90e+05

# Step-wise lm performance on training and test sets
postResample(predict(lm_model2,trainData), dat_tr_re_mat_y)
#>     RMSE Rsquared      MAE 
#> 9.76e+05 4.76e-01 4.97e+05
postResample(predict(lm_model2,testData), dat_te_re_mat_y)
#>     RMSE Rsquared      MAE 
#> 9.64e+05 4.99e-01 4.89e+05

::: We then compared the performance of the Lasso and Ridge models with the stepwise linear regression model using RMSE and MAE:

Click to show code
# Calculate RMSE and MAE
get_metrics <- function(predictions, actuals) {
  RMSE <- sqrt(mean((predictions - actuals)^2))
  MAE <- mean(abs(predictions - actuals))
  Rsquared <- cor(predictions, actuals)^2

  
  return(c(RMSE = RMSE, MAE = MAE, Rsquared = Rsquared) )
}

# Capture the performance metrics
ridge_train_preds <- predict(ridge_fit_best, newx = dat_tr_re_mat_x)
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
lasso_train_preds <- predict(lasso_fit_best, newx = dat_tr_re_mat_x)
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
ridge_test_preds <- predict(ridge_fit_best, newx = dat_te_re_mat_x)
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
lasso_test_preds <- predict(lasso_fit_best, newx = dat_te_re_mat_x)
#> Warning in cbind2(1, newx) %*% nbeta: NAs introduced by coercion
lm_train_preds <- predict(lm_model2, trainData)
lm_test_preds <- predict(lm_model2, testData)

ridge_train_metrics <- get_metrics(ridge_train_preds, dat_tr_re_mat_y)
lasso_train_metrics <- get_metrics(lasso_train_preds, dat_tr_re_mat_y)
ridge_test_metrics <- get_metrics(ridge_test_preds, dat_te_re_mat_y)
lasso_test_metrics <- get_metrics(lasso_test_preds, dat_te_re_mat_y)
lm_train_metrics <- get_metrics(lm_train_preds, dat_tr_re_mat_y)
lm_test_metrics <- get_metrics(lm_test_preds, dat_te_re_mat_y)

# Create a data frame with the performance metrics
performance_df <- data.frame(
  Model = c("Ridge (Training)", "Lasso (Training)", "Ridge (Test)", "Lasso (Test)", "Stepwise (Training)", "Stepwise (Test)"),
  RMSE = c(ridge_train_metrics["RMSE"], lasso_train_metrics["RMSE"], ridge_test_metrics["RMSE"], lasso_test_metrics["RMSE"], lm_train_metrics["RMSE"], lm_test_metrics["RMSE"]),
  MAE = c(ridge_train_metrics["MAE"], lasso_train_metrics["MAE"], ridge_test_metrics["MAE"], lasso_test_metrics["MAE"], lm_train_metrics["MAE"], lm_test_metrics["MAE"]),
  Rsquared = c(ridge_train_metrics["Rsquared"], lasso_train_metrics["Rsquared"], ridge_test_metrics["Rsquared"], lasso_test_metrics["Rsquared"], lm_train_metrics["Rsquared"], lm_test_metrics["Rsquared"])
)

# Create the kable extra table
performance_table <- kable(performance_df, format = "html") %>%
  kable_styling(full_width = FALSE, position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed")) %>%
  add_header_above(c( "Performance Metrics (RMSE, MAE, R-sq)" = 4))

# Print the table
performance_table
Performance Metrics (RMSE, MAE, R-sq)
Model RMSE MAE Rsquared
Ridge (Training) 1003013 508981 0.450
Lasso (Training) 993240 500232 0.457
Ridge (Test) 995990 500158 0.476
Lasso (Test) 981526 490425 0.482
Stepwise (Training) 975674 496834 0.476
Stepwise (Test) 964156 488684 0.499

The Stewise model is thus the best model because it has the lowest RMSE and MAE values on the test and training set. CHANGE FOR WHEN WE HAVE CORRECT DATAwe observe that the Lasso model outperforms the Ridge model in terms of RMSE and R-squared values, indicating that Lasso regularization may be more effective in this context.

5.1.2.2.3 Metrics

Here we compare the performance metrics of the initial model and the stepwise model. The metrics of our initial model : ::: {.cell layout-align=“center”}

Click to show code
metrics_1
Basic Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.476 0.475 964081 488516 501282

:::

Stepwise model: ::: {.cell layout-align=“center”}

Click to show code
# print R-squared and Adj R-squared and RMSE and MAE and AIC
r_sq <- summary(lm_model2)$r.squared
adj_r_sq <- summary(lm_model2)$adj.r.squared
rmse <- rmse(testData$price, predict(lm_model2, newdata=testData))
mae <- mae(testData$price, predict(lm_model2, newdata=testData))
aic <- AIC(lm_model2)
# show those metrics in a html table
metrics_2 <- kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format = "html", col.names = c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))  %>%
  add_header_above(c("Stepwise Model Performance Metrics" = 5)) 
metrics_2
Stepwise Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.476 0.475 964156 488684 501280

:::

5.1.2.3 Cross-Validation

Cross-validation was used to assess the model’s robustness, typically to avoid overfitting and ensure that the model generalizes well to new data., using the caret package to manage this process efficiently. The CV results show metrics like RMSE and R-squared across folds, which indicate the model’s performance stability across different subsets of the data.

10-fold cross-validation Metrics: ::: {.cell layout-align=“center”}

Click to show code
#add + Demographic_cluster +Political_cluster + Tax_cluster after dealing with NAN
## Cross-Validation
cv_results <- train(price ~ number_of_rooms + square_meters + year_category + property_type , data=trainData, method="lm", trControl=trainControl(method="cv", number=10))
summary(cv_results)
#> 
#> Call:
#> lm(formula = .outcome ~ ., data = dat)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -18710178   -362677   -115403    196960  16761951 
#> 
#> Coefficients:
#>                                 Estimate Std. Error t value Pr(>|t|)
#> (Intercept)                      -458336      40280  -11.38  < 2e-16
#> number_of_rooms                    25064       6334    3.96  7.6e-05
#> square_meters                       8510        106   80.33  < 2e-16
#> `year_category1919-1945`          233395      55004    4.24  2.2e-05
#> `year_category1946-1960`          284299      51735    5.50  4.0e-08
#> `year_category1961-1970`          221613      43818    5.06  4.3e-07
#> `year_category1971-1980`          304335      39063    7.79  7.0e-15
#> `year_category1981-1990`          278800      39462    7.06  1.7e-12
#> `year_category1991-2000`          331674      40977    8.09  6.2e-16
#> `year_category2001-2005`          479359      49496    9.68  < 2e-16
#> `year_category2006-2010`          533222      43585   12.23  < 2e-16
#> `year_category2011-2015`          568439      42372   13.42  < 2e-16
#> `year_category2016-2024`          415840      33203   12.52  < 2e-16
#> `property_typeAttic flat`         115759      41377    2.80  0.00515
#> `property_typeBifamiliar house`  -198123      39137   -5.06  4.2e-07
#> property_typeChalet               173521      49288    3.52  0.00043
#> property_typeDuplex               -93790      50718   -1.85  0.06444
#> `property_typeFarm house`        -487279     115093   -4.23  2.3e-05
#> property_typeLoft                 -93595     263370   -0.36  0.72231
#> `property_typeRoof flat`          -63989      58694   -1.09  0.27564
#> `property_typeRustic house`      -118736     233077   -0.51  0.61046
#> `property_typeSingle house`      -143963      22845   -6.30  3.0e-10
#> `property_typeTerrace flat`       -32966      84163   -0.39  0.69529
#> property_typeVilla                155989      36692    4.25  2.1e-05
#>                                    
#> (Intercept)                     ***
#> number_of_rooms                 ***
#> square_meters                   ***
#> `year_category1919-1945`        ***
#> `year_category1946-1960`        ***
#> `year_category1961-1970`        ***
#> `year_category1971-1980`        ***
#> `year_category1981-1990`        ***
#> `year_category1991-2000`        ***
#> `year_category2001-2005`        ***
#> `year_category2006-2010`        ***
#> `year_category2011-2015`        ***
#> `year_category2016-2024`        ***
#> `property_typeAttic flat`       ** 
#> `property_typeBifamiliar house` ***
#> property_typeChalet             ***
#> property_typeDuplex             .  
#> `property_typeFarm house`       ***
#> property_typeLoft                  
#> `property_typeRoof flat`           
#> `property_typeRustic house`        
#> `property_typeSingle house`     ***
#> `property_typeTerrace flat`        
#> property_typeVilla              ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 984000 on 16453 degrees of freedom
#> Multiple R-squared:  0.468,  Adjusted R-squared:  0.467 
#> F-statistic:  629 on 23 and 16453 DF,  p-value: <2e-16
#show the CV result in the html
metrics_cv_1 <- kable(cv_results$results, format = "html") %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))  %>%
  add_header_above(c("10 Fold Cross Validation Metrics" = 7))
metrics_cv_1
10 Fold Cross Validation Metrics
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
TRUE 983467 0.475 506256 127699 0.062 25120

:::

Here are the performance metrics for the stepwise model: ::: {.cell layout-align=“center”}

Click to show code
metrics_2
Stepwise Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.476 0.475 964156 488684 501280

:::

  • Write the comparison with stepwise model, seems robust ?

5.1.2.4 Model testing

We chose the model …. WRITE WHICH MODEL IS THE BEST AND WHY

The final model was tested using the unseen test dataset to evaluate its predictive accuracy. Residual plots and actual vs. predicted price plots were created to visually assess model performance:

5.1.2.4.1 Residual vs Predicted Prices

This plot shows residuals (differences between observed and predicted prices) against predicted prices. Ideally, residuals should randomly scatter around the horizontal line at zero, indicating that the model doesn’t systematically overestimate or underestimate prices.

Click to show code
# Model Testing 
## Test the Model
predicted_prices <- predict(lm_model2, newdata=testData)
testData$predicted_prices <- predicted_prices  # Add to testData to ensure alignment
# Calculate residuals
testData$test_residuals <- testData$price - predicted_prices  # Manually compute residuals

# Residual Analysis
gg <- ggplot(data = testData, aes(x = predicted_prices, y = test_residuals)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Residuals vs Predicted Prices", x = "Predicted Prices", y = "Residuals")

# Convert ggplot to plotly
p <- ggplotly(gg, width = 600, height = 400)

# Show the interactive plot
p

As we can observe, the spread of residuals suggests potential issues with model fit, particularly for higher price predictions where the variance seems to increase.

5.1.2.4.2 Actual vs Predicted Prices

This plot should ideally show points along the diagonal line, where actual prices equal predicted prices. The data clustering along the line suggests a generally good model fit, but here we can observe the spread which indicates variance in predictions, especially at higher price points. ::: {.cell layout-align=“center”}

Click to show code
## Visualization
gg <- ggplot(data=testData, aes(x=predicted_prices, y=price)) +
    geom_point() +
    geom_smooth(method="lm", col="blue") +
    labs(title="Actual vs Predicted Prices", x="Predicted Prices", y="Actual Prices")

# Convert ggplot to plotly
p <- ggplotly(gg, width = 600, height = 400)

# Show the interactive plot
p

:::

5.1.3 Linear Regression between 10th and 90th percentile

To solve this issue of variance at higher price points, we can filter the data to focus on a more specific range of prices. Here, we filter the data between the 10th and 90th percentiles of the price distribution to see if the model performs better within this range:

Click to show code
#filter properties_filtered based on the 10th percentile and 90th percentile of the data
properties_filtered <- properties_filtered %>% 
  filter(price >= quantile(price, 0.1) & price <= quantile(price, 0.9))

5.1.3.1 Model Building and Evaluation

We then repeat the model building and evaluation process for this filtered dataset to compare the performance with the initial (best) model: ::: {.cell layout-align=“center”}

Click to show code
# Model Building
## Split the Data
set.seed(123)  # for reproducibility
trainIndex <- createDataPartition(properties_filtered$price, p=0.8, list=FALSE)
trainData <- properties_filtered[trainIndex, ]
testData <- properties_filtered[-trainIndex, ]

## Fit the Model
lm_model_10_90 <- lm(price ~ number_of_rooms + square_meters + property_type + floor + year_category  + Political_cluster + Tax_cluster + Demographic_cluster, data=trainData)

summary(lm_model_10_90)
#> 
#> Call:
#> lm(formula = price ~ number_of_rooms + square_meters + property_type + 
#>     floor + year_category + Political_cluster + Tax_cluster + 
#>     Demographic_cluster, data = trainData)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -3809626  -256783   -83928   199375  1624676 
#> 
#> Coefficients: (1 not defined because of singularities)
#>                               Estimate Std. Error t value Pr(>|t|)
#> (Intercept)                   437960.9    23710.2   18.47  < 2e-16
#> number_of_rooms                14008.0     3241.7    4.32  1.6e-05
#> square_meters                   3232.6       70.3   46.01  < 2e-16
#> property_typeAttic flat       124835.0    17516.8    7.13  1.1e-12
#> property_typeBifamiliar house 121938.3    16384.0    7.44  1.0e-13
#> property_typeChalet            92514.8    23404.6    3.95  7.8e-05
#> property_typeDuplex            22394.2    20890.9    1.07  0.28376
#> property_typeFarm house       182737.0    49763.6    3.67  0.00024
#> property_typeLoft             -64361.7    98867.4   -0.65  0.51506
#> property_typeRoof flat          4933.8    25020.2    0.20  0.84368
#> property_typeRustic house       7936.9   220973.6    0.04  0.97135
#> property_typeSingle house      75866.0    10438.3    7.27  3.9e-13
#> property_typeTerrace flat      84107.8    33478.2    2.51  0.01201
#> property_typeVilla            154437.4    17023.5    9.07  < 2e-16
#> flooreg                        26124.1    10141.6    2.58  0.01001
#> floornoteg                          NA         NA      NA       NA
#> year_category1919-1945         47128.3    25037.5    1.88  0.05982
#> year_category1946-1960         64406.7    23775.1    2.71  0.00676
#> year_category1961-1970        164562.9    20564.8    8.00  1.3e-15
#> year_category1971-1980        158946.7    18239.9    8.71  < 2e-16
#> year_category1981-1990        166644.7    18132.8    9.19  < 2e-16
#> year_category1991-2000        166987.0    18628.6    8.96  < 2e-16
#> year_category2001-2005        314519.5    22387.9   14.05  < 2e-16
#> year_category2006-2010        352610.7    19730.0   17.87  < 2e-16
#> year_category2011-2015        332179.7    19185.5   17.31  < 2e-16
#> year_category2016-2024        239124.3    15491.6   15.44  < 2e-16
#> Political_cluster             -32103.5     2213.5  -14.50  < 2e-16
#> Tax_cluster                   -41058.1     3095.0  -13.27  < 2e-16
#> Demographic_cluster            40234.4     3076.0   13.08  < 2e-16
#>                                  
#> (Intercept)                   ***
#> number_of_rooms               ***
#> square_meters                 ***
#> property_typeAttic flat       ***
#> property_typeBifamiliar house ***
#> property_typeChalet           ***
#> property_typeDuplex              
#> property_typeFarm house       ***
#> property_typeLoft                
#> property_typeRoof flat           
#> property_typeRustic house        
#> property_typeSingle house     ***
#> property_typeTerrace flat     *  
#> property_typeVilla            ***
#> flooreg                       *  
#> floornoteg                       
#> year_category1919-1945        .  
#> year_category1946-1960        ** 
#> year_category1961-1970        ***
#> year_category1971-1980        ***
#> year_category1981-1990        ***
#> year_category1991-2000        ***
#> year_category2001-2005        ***
#> year_category2006-2010        ***
#> year_category2011-2015        ***
#> year_category2016-2024        ***
#> Political_cluster             ***
#> Tax_cluster                   ***
#> Demographic_cluster           ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 382000 on 13217 degrees of freedom
#> Multiple R-squared:  0.33,   Adjusted R-squared:  0.329 
#> F-statistic:  241 on 27 and 13217 DF,  p-value: <2e-16

# Model Evaluation
## Diagnostic Checks
#plot(lm_model)
#ad.test(residuals(lm_model))

#use gt summary to show the result
tbl_reg_1_10_90 <- gtsummary::tbl_regression(lm_model_10_90)

tbl_reg_3_vs_10_90 <- tbl_merge(
  tbls= list(tbl_reg_3, tbl_reg_1_10_90),
  tab_spanner = c("**Stepwise Model (All Prices)**", "**Basic Model (10-90 Qt)**")
  )
tbl_reg_3_vs_10_90
Characteristic Stepwise Model (All Prices) Basic Model (10-90 Qt)
Beta 95% CI1 p-value Beta 95% CI1 p-value Beta 95% CI1 p-value
number_of_rooms 25,198 12,769, 37,626 <0.001 25,241 12,813, 37,669 <0.001 14,008 7,654, 20,362 <0.001
square_meters 8,494 8,285, 8,703 <0.001 8,493 8,285, 8,702 <0.001 3,233 3,095, 3,370 <0.001
property_type








    Apartment


    Attic flat 129,020 47,784, 210,257 0.002 124,293 43,754, 204,832 0.002 124,835 90,500, 159,170 <0.001
    Bifamiliar house -188,611 -265,804, -111,418 <0.001 -193,462 -269,881, -117,043 <0.001 121,938 89,823, 154,053 <0.001
    Chalet 143,905 46,808, 241,002 0.004 139,484 42,897, 236,071 0.005 92,515 46,638, 138,391 <0.001
    Duplex -96,168 -194,952, 2,617 0.056 -96,396 -195,178, 2,387 0.056 22,394 -18,555, 63,343 0.3
    Farm house -436,410 -660,919, -211,902 <0.001 -440,487 -664,807, -216,167 <0.001 182,737 85,193, 280,281 <0.001
    Loft -144,419 -657,067, 368,228 0.6 -143,544 -656,184, 369,096 0.6 -64,362 -258,156, 129,433 0.5
    Roof flat -74,019 -188,970, 40,932 0.2 -78,755 -193,212, 35,702 0.2 4,934 -44,109, 53,977 0.8
    Rustic house -109,998 -563,801, 343,804 0.6 -113,914 -567,627, 339,800 0.6 7,937 -425,203, 441,077 >0.9
    Single house -135,237 -180,832, -89,643 <0.001 -139,564 -184,110, -95,018 <0.001 75,866 55,405, 96,327 <0.001
    Terrace flat -54,410 -218,363, 109,543 0.5 -52,403 -216,292, 111,487 0.5 84,108 18,486, 149,730 0.012
    Villa 192,125 119,772, 264,479 <0.001 187,713 116,043, 259,384 <0.001 154,437 121,069, 187,806 <0.001
floor








    floor




    eg 20,745 -25,864, 67,355 0.4


26,124 6,245, 46,003 0.010
    noteg








year_category








    0-1919


    1919-1945 236,188 129,145, 343,232 <0.001 236,459 129,418, 343,501 <0.001 47,128 -1,949, 96,205 0.060
    1946-1960 294,589 193,908, 395,269 <0.001 294,540 193,860, 395,220 <0.001 64,407 17,804, 111,009 0.007
    1961-1970 239,390 154,092, 324,687 <0.001 239,026 153,733, 324,318 <0.001 164,563 124,253, 204,873 <0.001
    1971-1980 315,114 239,077, 391,150 <0.001 314,454 238,432, 390,475 <0.001 158,947 123,194, 194,700 <0.001
    1981-1990 297,688 220,824, 374,552 <0.001 297,736 220,872, 374,599 <0.001 166,645 131,102, 202,188 <0.001
    1991-2000 355,533 275,643, 435,424 <0.001 355,709 275,821, 435,598 <0.001 166,987 130,472, 203,502 <0.001
    2001-2005 498,869 402,410, 595,329 <0.001 499,368 402,916, 595,820 <0.001 314,519 270,636, 358,403 <0.001
    2006-2010 560,413 475,481, 645,346 <0.001 560,818 475,891, 645,745 <0.001 352,611 313,937, 391,284 <0.001
    2011-2015 592,744 510,202, 675,286 <0.001 593,261 510,728, 675,794 <0.001 332,180 294,573, 369,786 <0.001
    2016-2024 453,204 388,244, 518,164 <0.001 455,027 390,197, 519,857 <0.001 239,124 208,759, 269,490 <0.001
Demographic_cluster 73,605 60,028, 87,182 <0.001 73,690 60,114, 87,266 <0.001 40,234 34,205, 46,264 <0.001
Political_cluster -52,929 -63,080, -42,778 <0.001 -52,977 -63,127, -42,826 <0.001 -32,103 -36,442, -27,765 <0.001
Tax_cluster -73,874 -88,207, -59,542 <0.001 -73,842 -88,174, -59,510 <0.001 -41,058 -47,125, -34,992 <0.001
1 CI = Confidence Interval

:::

5.1.3.1.1 Metrics

Here are the performance metrics for the model with prices between the 10th and 90th percentiles: ::: {.cell layout-align=“center”}

Click to show code
# print R-squared and Adj R-squared and RMSE and MAE and AIC
r_sq <- summary(lm_model_10_90)$r.squared
adj_r_sq <- summary(lm_model_10_90)$adj.r.squared
rmse <- rmse(testData$price, predict(lm_model_10_90))
#> Warning in actual - predicted: longer object length is not a
#> multiple of shorter object length
mae <- mae(testData$price, predict(lm_model_10_90, newdata=testData))
aic <- AIC(lm_model_10_90)
# show those metrics in a html table
metrics_1_10_90 <- kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format = "html", col.names = c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed")) %>%
  add_header_above(c("Basic Model Performance Metrics (10-90 Qt)" = 5))  
metrics_1_10_90
Basic Model Performance Metrics (10-90 Qt)
R-Squared Adjusted R-Squared RMSE MAE AIC
0.33 0.329 531600 291884 378076

:::

Here was the previous metrics of our first Basic model (without the 10-90 Qt filter) ::: {.cell layout-align=“center”}

Click to show code
metrics_2
Stepwise Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.476 0.475 964156 488684 501280

:::

5.1.3.2 Variable Selection

Click to show code
# stepwise regression
lm_model2_10_90 <- step(lm_model_10_90)
#> Start:  AIC=340487
#> price ~ number_of_rooms + square_meters + property_type + floor + 
#>     year_category + Political_cluster + Tax_cluster + Demographic_cluster
#> 
#>                       Df Sum of Sq      RSS    AIC
#> <none>                             1.93e+15 340487
#> - floor                1  9.67e+11 1.93e+15 340491
#> - number_of_rooms      1  2.72e+12 1.93e+15 340503
#> - property_type       10  1.25e+13 1.94e+15 340552
#> - Demographic_cluster  1  2.49e+13 1.95e+15 340655
#> - Tax_cluster          1  2.56e+13 1.95e+15 340660
#> - Political_cluster    1  3.06e+13 1.96e+15 340694
#> - year_category       10  8.56e+13 2.01e+15 341043
#> - square_meters        1  3.08e+14 2.23e+15 342452
# plot(lm_model2)
# ad.test(residuals(lm_model2))

lm_model2_10_90
#> 
#> Call:
#> lm(formula = price ~ number_of_rooms + square_meters + property_type + 
#>     floor + year_category + Political_cluster + Tax_cluster + 
#>     Demographic_cluster, data = trainData)
#> 
#> Coefficients:
#>                   (Intercept)                number_of_rooms  
#>                        437961                          14008  
#>                 square_meters        property_typeAttic flat  
#>                          3233                         124835  
#> property_typeBifamiliar house            property_typeChalet  
#>                        121938                          92515  
#>           property_typeDuplex        property_typeFarm house  
#>                         22394                         182737  
#>             property_typeLoft         property_typeRoof flat  
#>                        -64362                           4934  
#>     property_typeRustic house      property_typeSingle house  
#>                          7937                          75866  
#>     property_typeTerrace flat             property_typeVilla  
#>                         84108                         154437  
#>                       flooreg                     floornoteg  
#>                         26124                             NA  
#>        year_category1919-1945         year_category1946-1960  
#>                         47128                          64407  
#>        year_category1961-1970         year_category1971-1980  
#>                        164563                         158947  
#>        year_category1981-1990         year_category1991-2000  
#>                        166645                         166987  
#>        year_category2001-2005         year_category2006-2010  
#>                        314519                         352611  
#>        year_category2011-2015         year_category2016-2024  
#>                        332180                         239124  
#>             Political_cluster                    Tax_cluster  
#>                        -32103                         -41058  
#>           Demographic_cluster  
#>                         40234

#use gt summary to show the result 
tbl_reg_2_10_90 <- gtsummary::tbl_regression(lm_model2_10_90)
tbl_reg_3_10_90 <- tbl_merge(
  tbls= list(tbl_reg_1_10_90, tbl_reg_2_10_90),
  tab_spanner = c("**Basic Model (10-90 Qt)**", "**Stepwise Model (10-90 Qt)**")
  )
tbl_reg_3_10_90
Characteristic Basic Model (10-90 Qt) Stepwise Model (10-90 Qt)
Beta 95% CI1 p-value Beta 95% CI1 p-value
number_of_rooms 14,008 7,654, 20,362 <0.001 14,008 7,654, 20,362 <0.001
square_meters 3,233 3,095, 3,370 <0.001 3,233 3,095, 3,370 <0.001
property_type





    Apartment

    Attic flat 124,835 90,500, 159,170 <0.001 124,835 90,500, 159,170 <0.001
    Bifamiliar house 121,938 89,823, 154,053 <0.001 121,938 89,823, 154,053 <0.001
    Chalet 92,515 46,638, 138,391 <0.001 92,515 46,638, 138,391 <0.001
    Duplex 22,394 -18,555, 63,343 0.3 22,394 -18,555, 63,343 0.3
    Farm house 182,737 85,193, 280,281 <0.001 182,737 85,193, 280,281 <0.001
    Loft -64,362 -258,156, 129,433 0.5 -64,362 -258,156, 129,433 0.5
    Roof flat 4,934 -44,109, 53,977 0.8 4,934 -44,109, 53,977 0.8
    Rustic house 7,937 -425,203, 441,077 >0.9 7,937 -425,203, 441,077 >0.9
    Single house 75,866 55,405, 96,327 <0.001 75,866 55,405, 96,327 <0.001
    Terrace flat 84,108 18,486, 149,730 0.012 84,108 18,486, 149,730 0.012
    Villa 154,437 121,069, 187,806 <0.001 154,437 121,069, 187,806 <0.001
floor





    floor

    eg 26,124 6,245, 46,003 0.010 26,124 6,245, 46,003 0.010
    noteg





year_category





    0-1919

    1919-1945 47,128 -1,949, 96,205 0.060 47,128 -1,949, 96,205 0.060
    1946-1960 64,407 17,804, 111,009 0.007 64,407 17,804, 111,009 0.007
    1961-1970 164,563 124,253, 204,873 <0.001 164,563 124,253, 204,873 <0.001
    1971-1980 158,947 123,194, 194,700 <0.001 158,947 123,194, 194,700 <0.001
    1981-1990 166,645 131,102, 202,188 <0.001 166,645 131,102, 202,188 <0.001
    1991-2000 166,987 130,472, 203,502 <0.001 166,987 130,472, 203,502 <0.001
    2001-2005 314,519 270,636, 358,403 <0.001 314,519 270,636, 358,403 <0.001
    2006-2010 352,611 313,937, 391,284 <0.001 352,611 313,937, 391,284 <0.001
    2011-2015 332,180 294,573, 369,786 <0.001 332,180 294,573, 369,786 <0.001
    2016-2024 239,124 208,759, 269,490 <0.001 239,124 208,759, 269,490 <0.001
Political_cluster -32,103 -36,442, -27,765 <0.001 -32,103 -36,442, -27,765 <0.001
Tax_cluster -41,058 -47,125, -34,992 <0.001 -41,058 -47,125, -34,992 <0.001
Demographic_cluster 40,234 34,205, 46,264 <0.001 40,234 34,205, 46,264 <0.001
1 CI = Confidence Interval
5.1.3.2.1 Metrics

Here are the performance metrics for the stepwise model with prices between the 10th and 90th percentiles as well as the comparison with the initial model: ::: {.cell layout-align=“center”}

Click to show code
## Performance Metrics
r_sq <- summary(lm_model2_10_90)$r.squared
adj_r_sq <- summary(lm_model2_10_90)$adj.r.squared
rmse <- rmse(testData$price, predict(lm_model2_10_90))
#> Warning in actual - predicted: longer object length is not a
#> multiple of shorter object length
mae <- mae(testData$price, predict(lm_model2_10_90, newdata=testData))
aic <- AIC(lm_model2_10_90)
# show those metrics in a html table
metrics_2_10_90 <- kable(data.frame(r_sq, adj_r_sq, rmse, mae, aic), format = "html", col.names = c("R-Squared", "Adjusted R-Squared", "RMSE", "MAE", "AIC")) %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed")) %>%
  add_header_above(c("Stepwise Model Performance Metrics (10-90 Qt)" = 5))
metrics_2_10_90
Stepwise Model Performance Metrics (10-90 Qt)
R-Squared Adjusted R-Squared RMSE MAE AIC
0.33 0.329 531600 291884 378076

:::

Here was the previous metrics of our Basic Model (without Stepwise Integration) ::: {.cell layout-align=“center”}

Click to show code
metrics_1_10_90
Basic Model Performance Metrics (10-90 Qt)
R-Squared Adjusted R-Squared RMSE MAE AIC
0.33 0.329 531600 291884 378076

:::

Here was the previous metrics of our stepwise model (without the 10-90 Qt filter) ::: {.cell layout-align=“center”}

Click to show code
metrics_2
Stepwise Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.476 0.475 964156 488684 501280

:::

5.1.3.3 Cross-Validation

Click to show code
## Cross-Validation
cv_results2 <- train(price ~ number_of_rooms + square_meters + year_category + property_type, data=trainData, method="lm", trControl=trainControl(method="cv", number=10))
summary(cv_results2)
#> 
#> Call:
#> lm(formula = .outcome ~ ., data = dat)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -3780853  -265807   -92952   203954  1605843 
#> 
#> Coefficients:
#>                                 Estimate Std. Error t value Pr(>|t|)
#> (Intercept)                     339511.6    20122.3   16.87  < 2e-16
#> number_of_rooms                  13722.4     3263.3    4.21  2.6e-05
#> square_meters                     3188.0       70.5   45.20  < 2e-16
#> `year_category1919-1945`         45021.5    25441.5    1.77   0.0768
#> `year_category1946-1960`         60596.7    24157.5    2.51   0.0121
#> `year_category1961-1970`        158331.7    20893.4    7.58  3.7e-14
#> `year_category1971-1980`        153564.3    18527.9    8.29  < 2e-16
#> `year_category1981-1990`        158894.6    18422.1    8.63  < 2e-16
#> `year_category1991-2000`        156792.6    18907.8    8.29  < 2e-16
#> `year_category2001-2005`        304972.5    22729.3   13.42  < 2e-16
#> `year_category2006-2010`        336133.4    20026.6   16.78  < 2e-16
#> `year_category2011-2015`        316724.0    19475.4   16.26  < 2e-16
#> `year_category2016-2024`        219396.1    15677.8   13.99  < 2e-16
#> `property_typeAttic flat`       112262.6    17627.7    6.37  2.0e-10
#> `property_typeBifamiliar house` 118815.5    16405.6    7.24  4.7e-13
#> property_typeChalet             100036.5    23552.3    4.25  2.2e-05
#> property_typeDuplex              20502.5    21189.4    0.97   0.3333
#> `property_typeFarm house`       152275.1    50417.0    3.02   0.0025
#> property_typeLoft               -48313.3   100443.0   -0.48   0.6305
#> `property_typeRoof flat`         10763.2    25251.9    0.43   0.6699
#> `property_typeRustic house`     -27078.8   224539.8   -0.12   0.9040
#> `property_typeSingle house`      70341.2    10342.6    6.80  1.1e-11
#> `property_typeTerrace flat`     107479.0    33974.6    3.16   0.0016
#> property_typeVilla              124013.8    17069.0    7.27  3.9e-13
#>                                    
#> (Intercept)                     ***
#> number_of_rooms                 ***
#> square_meters                   ***
#> `year_category1919-1945`        .  
#> `year_category1946-1960`        *  
#> `year_category1961-1970`        ***
#> `year_category1971-1980`        ***
#> `year_category1981-1990`        ***
#> `year_category1991-2000`        ***
#> `year_category2001-2005`        ***
#> `year_category2006-2010`        ***
#> `year_category2011-2015`        ***
#> `year_category2016-2024`        ***
#> `property_typeAttic flat`       ***
#> `property_typeBifamiliar house` ***
#> property_typeChalet             ***
#> property_typeDuplex                
#> `property_typeFarm house`       ** 
#> property_typeLoft                  
#> `property_typeRoof flat`           
#> `property_typeRustic house`        
#> `property_typeSingle house`     ***
#> `property_typeTerrace flat`     ** 
#> property_typeVilla              ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 388000 on 13221 degrees of freedom
#> Multiple R-squared:  0.308,  Adjusted R-squared:  0.307 
#> F-statistic:  256 on 23 and 13221 DF,  p-value: <2e-16

#show the CV result in the html
metrics_cv2 <- kable(cv_results2$results, format = "html") %>%
  
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))  %>%
  add_header_above(c("10 Fold Cross Validation Metrics (10-90th Qt)" = 7))
metrics_cv2
10 Fold Cross Validation Metrics (10-90th Qt)
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
TRUE 388423 0.307 3e+05 10361 0.032 6108

Here was the previous metrics of our first Basic Model (without the 10-90 Qt filter) ::: {.cell layout-align=“center”}

Click to show code
metrics_cv_1
10 Fold Cross Validation Metrics
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
TRUE 983467 0.475 506256 127699 0.062 25120

:::

5.1.3.4 Model testing

5.1.3.4.1 Residual vs Predicted Prices
Click to show code
# Model Testing 
## Test the Model
predicted_prices <- predict(lm_model2_10_90, newdata=testData)
testData$predicted_prices <- predicted_prices  # Add to testData to ensure alignment
# Calculate residuals
testData$test_residuals <- testData$price - predicted_prices  # Manually compute residuals

# Residual Analysis
gg <- ggplot(data = testData, aes(x = predicted_prices, y = test_residuals)) +
  geom_point() +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Residuals vs Predicted Prices", x = "Predicted Prices", y = "Residuals")

# Convert ggplot to plotly
p <- ggplotly(gg, width = 600, height = 400)

# Show the interactive plot
p
5.1.3.4.2 Actual vs Predicted Prices
Click to show code
## Visualization
gg <- ggplot(data=testData, aes(x=predicted_prices, y=price)) +
    geom_point() +
    geom_smooth(method="lm", col="blue") +
    labs(title="Actual vs Predicted Prices", x="Predicted Prices", y="Actual Prices")

# Convert ggplot to plotly
p <- ggplotly(gg, width = 600, height = 400)

# Show the interactive plot
p

5.2 Model 2

5.2.1 Random Forest

We decided to implement a Random Forest model to compare its performance with the linear regression model. Random Forest is an ensemble learning method that builds multiple decision trees during training and outputs the mode of the classes or the mean prediction of the individual trees. This model is known for its robustness and ability to handle complex relationships in the data.

5.2.1.1 Model Building and Evaluation

We split the data into training and testing sets, fit the Random Forest model and then evaluated the model using performance metrics such as R-squared, RMSE, and MAE to assess its predictive accuracy and explanatory power. ::: {.cell layout-align=“center”}

Click to show code
#split the data in training and test set 0.8/0.2
set.seed(123)  # for reproducibility
trainIndex <- createDataPartition(properties_filtered$price, p=0.8, list=FALSE)
trainData <- properties_filtered[trainIndex, ]
testData <- properties_filtered[-trainIndex, ]

#apply the RF model as a regression
rf_model <- randomForest(price ~., data=trainData, ntree=500, importance=TRUE)
rf_model.pred_rf <- predict(rf_model, newdata=testData)


rmse_rf <- sqrt(mean((testData$price - rf_model.pred_rf)^2))
mae_rf <- mean(abs(testData$price - rf_model.pred_rf))
r_sq_rf <- cor(testData$price, rf_model.pred_rf)^2

# show those metrics in a html table
metrics_rf <- kable(data.frame(r_sq_rf, rmse_rf, mae_rf), format = "html", col.names = c("R-Squared", "RMSE", "MAE")) %>%
  kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed")) %>%
  add_header_above(c("Random Forest Model Performance Metrics" = 3))
metrics_rf
Random Forest Model Performance Metrics
R-Squared RMSE MAE
0.683 264181 195532

::: Comparing with the best model of the linear regression, we can see that the Random Forest model has a higher R-squared value and lower RMSE and MAE values, indicating better predictive accuracy. ::: {.cell layout-align=“center”}

Click to show code
metrics_2
Stepwise Model Performance Metrics
R-Squared Adjusted R-Squared RMSE MAE AIC
0.476 0.475 964156 488684 501280

:::

The plot shows the actual vs. predicted prices, with the diagonal line indicating perfect predictions. ::: {.cell layout-align=“center”}

Click to show code
plot(testData$price ~rf_model.pred_rf, col = 'skyblue',
     xlab = 'Actual Price', ylab = 'Predicted Price',
     main = 'Actual vs Predicted Price')

abline(0,1, col = 'darkred')

:::

5.2.1.2 Variable Importance

VI plots are a useful tool to understand the relative importance of predictors in the Random Forest model. This plot shows the importance of each predictor in the model, helping to identify key features that drive price predictions. ::: {.cell layout-align=“center”}

Click to show code
varImpPlot(rf_model)
importance(rf_model)
#>                     %IncMSE IncNodePurity
#> zip_code               82.4      2.55e+14
#> number_of_rooms        34.8      2.17e+14
#> square_meters         178.7      9.14e+14
#> address                57.7      1.70e+14
#> canton                 52.8      9.04e+13
#> property_type          32.6      8.24e+13
#> floor                  22.6      7.68e+13
#> year_category         131.1      2.22e+14
#> Community              82.1      1.26e+14
#> Canton_code            37.7      6.02e+13
#> lon                    73.2      2.15e+14
#> lat                    99.5      2.01e+14
#> Demographic_cluster    70.7      6.87e+13
#> Political_cluster      47.8      3.94e+13
#> Tax_cluster            35.7      3.27e+13

::: We see that square_meters is the most important predictor.

5.2.1.3 Cross-Validation

Click to show code
# cv_results_rf <- train(price ~., data=trainData, method="rf", trControl=trainControl(method="cv", number=5))
# summary(cv_results_rf)
# 
# #show the CV result in the html
# metrics_cv_rf <- kable(cv_results_rf$results, format = "html") %>%
#   kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed"))  %>%
#   add_header_above(c("10 Fold Cross Validation Metrics (Random Forest)" = 7))
# metrics_cv_rf

5.2.1.4 Hyperparameter Tuning

Click to show code
# # Define the tuning grid
# tuneGrid <- expand.grid(mtry = seq(2, sqrt(ncol(trainData)), by = 1))  # Tune over a range of mtry values
# 
# # Train the model with tuning
# rf_tuned <- train(price ~ ., data = trainData, method = "rf", 
#                   trControl = trainControl(method = "cv", number = 5, search = "grid"), 
#                   tuneGrid = tuneGrid, 
#                   ntree = 1000)
# 
# # Plotting the tuning effect
# plot(rf_tuned)
# 
# # Evaluate the tuned model
# rf_model_pred <- predict(rf_tuned, newdata = testData)
# rmse_rf <- sqrt(mean((testData$price - rf_model_pred)^2))
# mae_rf <- mean(abs(testData$price - rf_model_pred))
# r_sq_rf <- cor(testData$price, rf_model_pred)^2
# 
# # Show metrics
# metrics_rf <- kable(data.frame(R_Squared = r_sq_rf, RMSE = rmse_rf, MAE = mae_rf),
#                     format = "html", col.names = c("R-Squared", "RMSE", "MAE")) %>%
#              kable_styling(position = "center", bootstrap_options = c("striped", "bordered", "hover", "condensed")) %>%
#              add_header_above(c("Tuned Random Forest Model Performance Metrics" = 3))
# metrics_rf

6 Conclusion

  • Brief summary of the project
  • Take home message
  • Limitations
  • Future work?